OCR (Optical Character Recognition) in PHP

Ever wanted to extract text from images using PHP? That doesn’t seem trivial, does it? But you will find it’s not that hard after reading this article, so let’s get started!

 
First- we need an OCR engine. I once written such thing in pure PHP, but it was so slow that it wasn’t usable with bigger images. PHP is just not a perfect thing for such tasks.

 
Luckily there is a free OCR engine available for many platforms- including Windows and Linux that we can use as command-line app. It’s called GOCR and it’s located here. You can choose Windows or OS/2 binaries or packages for Fedora, Debian, SuSE and other. It’s open sourced so you can grab source code and compile it for yourself if you can’t find binaries for your platform.

 
Once we have GOCR we need a script in PHP to call the program from command line and get the results. Unfortunatelly GOCR does not support any of the PHP built-in image formats (png, jpeg, gif, xbm) so we need something to convert source image into something suitable for GOCR.

&nbps;
Let’s analyze the script line by line. First we need to include library to convert images to PNM (format supported by GOCR). We’ll use Ziin Image Formats:

include 'zif.php';
include 'zif_one.php';

Now we read source PNG image and save as PNM:

$im = imagecreatefromfile('input.png');
imagepnm($im, 'input2.pnm');

Then we call GOCR command-line app to do the hard part- optical character recognition:

exec('gocr049 input2.pnm', $text);

The result is stored in $text as an array so we convert the array to a string- each line separated with ‘<br>’:

$text = implode('&lt;br&gt;', $text);

And finally- we display the result:

echo $text;

Of course the result is almost never perfect and the quality of GOCR output depends on quality of the source image used. The bigger resulution, the less noise- the better recognition. GOCR does support teaching mode so we can use it to improve results- just run GOCR manually with the params:

gocr049 -m 130 -p ./db/ input2.pnm

And when you want to recognize images using database you created when teaching GOCR- use:

gocr049 -p ./db/ input.pnm

/db/ is the path to our GOCR database.

The whole script:

include 'zif.php';
include 'zif_one.php';
$im = imagecreatefromfile('input.png');
imagepnm($im, 'input2.pnm');
exec('gocr049 input2.pnm', $text);
$text = implode('&lt;br&gt;', $text);
echo $text;

Or same thing zipped with gocr for Windows and test image:

This entry was posted on Tuesday, March 6th, 2012 at 7:21 pm and is filed under PHP. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

9 Responses to “OCR (Optical Character Recognition) in PHP”

  1. Nikolay Says:

    As an alternative, you may use a cloud service – a web api that lets you upload an image and send you back an OCR’ed data. This service http://ocrsdk.com/producttour/programming-languages/ works with PHP and even has PHP code samples @ github:https://github.com/abbyysdk/ocrsdk.com/tree/master/PHP

    It’s not free but it’s still worth trying (it has a 90 days free trial).

  2. Sumit Madan Says:

    hi,
    This code is not working. It is printing blank array.
    exec(‘gocr input2.pnm’, $text);
    print_r($text);

    As i checked the “gocr fullpath/input2.pnm” in terminal then it worked fine.

  3. elango Says:

    where to get these two file?

    include ‘zif.php’;

    include ‘zif_one.php’;

  4. admin Says:

    These are from Ziin Image Formats- as mentioned in the article.

  5. admin Says:

    Maybe something wrong with the paths. It works for me.

  6. Willy Says:

    Hi… any chance to know what setting in ZIF or ImageMagick that I could use to make a good quality monochromatic picture???

  7. admin Says:

    The best results gives loop like this:
    imagefilter($im, IMG_FILTER_GRAYSCALE);
    imagetruecolortopalette($im, 255);
    for ($y=0; $y<imagesy($im); $y++)
    for ($x=0; $x<imagesx($im); $x++)
    if (imagecolorat($x,$y) > 127) imagesetpixel($x, $y, 0xFFFFFF); else imagesetpixel($x, $y, 0);

  8. Issac Says:

    where to get these two file?

    include ‘zif.php’;

    include ‘zif_one.php’;

  9. admin Says:

    As mentioned in the article- these files are part of ZIF: Ziin Image Formats. This is not a free library but it’s quite cheap.

Leave a Reply

 
TopOfBlogs Web Development & Design Blogs