Ever wanted to extract text from images using PHP? That doesn’t seem trivial, does it? But you will find it’s not that hard after reading this article, so let’s get started!
First- we need an OCR engine. I once written such thing in pure PHP, but it was so slow that it wasn’t usable with bigger images. PHP is just not a perfect thing for such tasks.
Luckily there is a free OCR engine available for many platforms- including Windows and Linux that we can use as command-line app. It’s called GOCR and it’s located here. You can choose Windows or OS/2 binaries or packages for Fedora, Debian, SuSE and other. It’s open sourced so you can grab source code and compile it for yourself if you can’t find binaries for your platform.
Once we have GOCR we need a script in PHP to call the program from command line and get the results. Unfortunatelly GOCR does not support any of the PHP built-in image formats (png, jpeg, gif, xbm) so we need something to convert source image into something suitable for GOCR.
Let’s analyze the script line by line. First we need to include library to convert images to PNM (format supported by GOCR). We’ll use Ziin Image Formats:
include 'zif.php'; include 'zif_one.php';
Now we read source PNG image and save as PNM:
$im = imagecreatefromfile('input.png'); imagepnm($im, 'input2.pnm');
Then we call GOCR command-line app to do the hard part- optical character recognition:
exec('gocr049 input2.pnm', $text);
The result is stored in $text as an array so we convert the array to a string- each line separated with ‘<br>’:
$text = implode('<br>', $text);
And finally- we display the result:
Of course the result is almost never perfect and the quality of GOCR output depends on quality of the source image used. The bigger resulution, the less noise- the better recognition. GOCR does support teaching mode so we can use it to improve results- just run GOCR manually with the params:
gocr049 -m 130 -p ./db/ input2.pnm
And when you want to recognize images using database you created when teaching GOCR- use:
gocr049 -p ./db/ input.pnm
/db/ is the path to our GOCR database.
The whole script:
include 'zif.php'; include 'zif_one.php'; $im = imagecreatefromfile('input.png'); imagepnm($im, 'input2.pnm'); exec('gocr049 input2.pnm', $text); $text = implode('<br>', $text); echo $text;
Or same thing zipped with gocr for Windows and test image: