a few words about web development

OCR (Optical Character Recognition) in PHP

When you want to get text from images or solve captchas
Ever wanted to extract text from images using PHP? That doesn't seem trivial, does it? But you will find it's not that hard after reading this article, so let's get started!

First- we need an OCR engine. I once written such thing in pure PHP, but it was so slow that it wasn't usable with bigger images. PHP is just not a perfect thing for such tasks.

Luckily there is a free OCR engine available for many platforms- including Windows and Linux that we can use as command-line app. It's called GOCR and it's located here. You can choose Windows or OS/2 binaries or packages for Fedora, Debian, SuSE and other. It's open sourced so you can grab source code and compile it for yourself if you can't find binaries for your platform.

Once we have GOCR we need a script in PHP to call the program from command line and get the results. Unfortunately GOCR does not support any of the PHP built-in image formats (png, jpeg, gif, xbm) so we need something to convert source image into something suitable for GOCR- a simple function I wrote:



Let's build our script.
First we include a function which can convert images to PNM format:
include 'PHP_PNM.php';
Now we read source PNG image and save as PNM:
$im = imagecreatefromfile('input.png');
imagepnm($im, 'input2.pnm');
Then we call GOCR command-line app to do the hard part- optical character recognition:
exec('gocr049 input2.pnm', $text);
The result is stored in $text as an array so we convert the array to a string- each line separated with '
':
$text = implode('
', $text);
And finally- we display the result:
echo $text;
Of course the result is almost never perfect and the quality of GOCR output depends on quality of the source image used. The bigger resulution, the less noise- the better recognition. GOCR does support teaching mode so we can use it to improve results- just run GOCR manually with the params:
gocr049 -m 130 -p ./db/ input2.pnm
And when you want to recognize images using database you created when teaching GOCR- use:
gocr049 -p ./db/ input.pnm
/db/ is the path to our GOCR database.


The whole script:
include 'PHP_PNM.php';
$im = imagecreatefromfile('input.png');
imagepnm($im, 'input2.pnm');
exec('gocr049 input2.pnm', $text);
$text = implode('
', $text);
echo $text;

Or same thing zipped with gocr for Windows and test image:

Comments