Building a letter classifier in PHP with Tesseract OCR and PHP ML

Published by Matthew Daly at 10th May 2018 10:50 pm

PHP isn't the first language that springs to mind when it comes to machine learning. However, it is practical to use PHP for machine learning purposes. In this tutorial I'll show you how to build a pipeline for classifying letters.

The brief

Before I was a web dev, I was a clerical worker for an FTSE-100 insurance company, doing a lot of work that nowadays is possible to automate away, if you know how. When they received a letter or other communication from a client, it would be sent to be scanned on. Once scanned, a human would have to look at it to classify it, eg was it a complaint, a request for information, a request for a quote, or something else, as well as assign it to a policy number. Let's imagine we've been asked to build a proof of concept for automating this process. This is a good example of a real-world problem that machine learning can help with.

As this is a proof of concept we aren't looking to build a web app for this - for simplicity's sake this will be a command-line application. Unlike emails, letters don't come in an easily machine-readable format, so we will be receiving them as PDF files (since they would have been scanned on, this is a reasonable assumption). Feel free to mock up your own example letters using your own classifications, but I will be classifying letters into four groups:

Complaints - letters expressing dissatisfaction
Information requests - letters requesting general information
Surrender quotes - letters requesting a surrender quote
Surrender forms - letters requesting surrender forms

Our application will therefore take in a PDF file at one end, and perform the following actions on it:

Convert the PDF file to a PNG file
Use OCR (optical character recognition) to convert the letter to plain text
Strip out unwanted whitespace
Extract any visible policy number from the text
Use a machine learning library to classify the letter, having taught it using prior examples

Sound interesting? Let's get started...

Introducing pipelines

As our application will be carrying out a series of discrete steps on our data, it makes sense to use the pipeline pattern for this project. Fortunately, the PHP League have produced a excellent package implementing this. We can therefore create a single class for each step in the process and have it handle that in isolation.

We'll also use the Symfony Console component to implement our command-line application. For our machine learning library we will be using PHP ML, which requires PHP 7.1 or greater. For OCR, we will be using Tesseract, so you will need to install the underlying Tesseract OCR library, as well as support for your language. On Ubuntu you can install these as follows:

$ sudo apt-get install tesseract-ocr tesseract-ocr-eng

This assumes you are using English, however you should be able to find packages to support many other languages. Finally, we need ImageMagick to be installed in order to convert PDF files to PNG's.

Your composer.json should look something like this:

1{
2    "name": "matthewbdaly/letter-classifier",
3    "description": "Demo of classifying letters in PHP",
4    "type": "project",
5    "require": {
6        "league/pipeline": "^0.3.0",
7        "thiagoalessio/tesseract_ocr": "^2.2",
8        "php-ai/php-ml": "^0.6.2",
9        "symfony/console": "^4.0"
10    },
11    "require-dev": {
12        "phpspec/phpspec": "^4.3",
13        "psy/psysh": "^0.8.17"
14    },
15    "autoload": {
16        "psr-4": {
17            "Matthewbdaly\\LetterClassifier\\": "src/"
18        }
19    },
20    "license": "MIT",
21    "authors": [
22        {
23            "name": "Matthew Daly",
24            "email": "matthewbdaly@gmail.com"
25        }
26    ]
27}

Next, let's write the outline of our command-line client. We'll load a single class for our processor command. Save this as app:

1#!/usr/bin/env php
2<?php
3
4require __DIR__.'/vendor/autoload.php';
5
6use Symfony\Component\Console\Application;
7use Matthewbdaly\LetterClassifier\Commands\Processor;
8
9$application = new Application();
10$application->add(new Processor());
11$application->run();

Next, we create our command. Save this as src/Commands/Processor.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Commands;
4
5use Symfony\Component\Console\Command\Command;
6use Symfony\Component\Console\Input\InputInterface;
7use Symfony\Component\Console\Output\OutputInterface;
8use Symfony\Component\Console\Input\InputArgument;
9use League\Pipeline\Pipeline;
10use Matthewbdaly\LetterClassifier\Stages\ConvertPdfToPng;
11use Matthewbdaly\LetterClassifier\Stages\ReadFile;
12use Matthewbdaly\LetterClassifier\Stages\Classify;
13use Matthewbdaly\LetterClassifier\Stages\StripTabs;
14use Matthewbdaly\LetterClassifier\Stages\GetPolicyNumber;
15
16class Processor extends Command
17{
18    protected function configure()
19    {
20        $this->setName('process')
21            ->setDescription('Processes a file')
22            ->setHelp('This command processes a file')
23            ->addArgument('file', InputArgument::REQUIRED, 'File to process');
24    }
25
26    protected function execute(InputInterface $input, OutputInterface $output)
27    {
28        $file = $input->getArgument('file');
29        $pipeline = (new Pipeline)
30            ->pipe(new ConvertPdfToPng)
31            ->pipe(new ReadFile)
32            ->pipe(new StripTabs)
33            ->pipe(new GetPolicyNumber)
34            ->pipe(new Classify);
35        $response = $pipeline->process($file);
36        $output->writeln("Classification is ".$response['classification']);
37        $output->writeln("Policy number is ".$response['policy']);
38    }
39}

Note how our command accepts the file name as an argument. We then instantiate our pipeline and pass it through a series of classes, each of which has a single role. Finally, we retrieve our response and output it.

With that done, we can move on to implementing our first step. Save this as src/Stages/ConvertPdfToPng.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Stages;
4
5use Imagick;
6
7class ConvertPdfToPng
8{
9    public function __invoke($file)
10    {
11        $tmp = tmpfile();
12        $uri = stream_get_meta_data($tmp)['uri'];
13        $img = new Imagick();
14        $img->setResolution(300, 300);
15        $img->readImage($file);
16        $img->setImageDepth(8);
17        $img->setImageFormat('png');
18        $img->writeImage($uri);
19        return $tmp;
20    }
21}

This stage fetches the file passed through, and converts it into a PNG file, stores it as a temporary file, and returns a reference to it. The output of this stage will then form the input of the next. This is how pipelines work, and it makes it easy to break up a complex process into multiple steps that can be reused in different places, facilitating easier code reuse and making your code simpler to understand and reason about.

Our next step carries out optical character recognition. Save this as src/Stages/ReadFile.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Stages;
4
5use thiagoalessio\TesseractOCR\TesseractOCR;
6
7class ReadFile
8{
9    public function __invoke($file)
10    {
11        $uri = stream_get_meta_data($file)['uri'];
12        $ocr = new TesseractOCR($uri);
13        return $ocr->lang('eng')->run();
14    }
15}

As you can see, this accepts the link to the temporary file as an argument, and runs Tesseract on it to retrieve the text. Note that we specify a language of eng - if you want to use a language other than English, you should specify it here.

At this point, we should have some usable text, but there may be unknown amounts of whitespace, so our next step uses a regex to strip them out. Save this as src/Stages/StripTabs.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Stages;
4
5class StripTabs
6{
7    public function __invoke($content)
8    {
9        return trim(preg_replace('/\s+/', ' ', $content));
10    }
11}

With our whitespace issue sorted out, we now need to retrieve the policy number the communication should be filed under. These are generally regular alphanumeric patterns, so regexes are a suitable way of matching them. As this is a proof of concept, we'll assume a very simple pattern for policy numbers in that they will consist of between seven and nine digits. Save this as src/Stages/GetPolicyNumber.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Stages;
4
5class GetPolicyNumber
6{
7    public function __invoke($content)
8    {
9        $matches = [];
10        $policyNumber = '';
11        preg_match('/\d{7,9}/', $content, $matches);
12        if (count($matches)) {
13            $policyNumber = $matches[0];
14        }
15        return [
16            'content' => $content,
17            'policy' => $policyNumber
18        ];
19    }
20}

Finally, we're onto the really tough part - using machine learning to classify the letters. Save this as src/Stages/Classify.php:

1<?php
2
3namespace Matthewbdaly\LetterClassifier\Stages;
4
5use Phpml\Dataset\CsvDataset;
6use Phpml\Dataset\ArrayDataset;
7use Phpml\FeatureExtraction\TokenCountVectorizer;
8use Phpml\Tokenization\WordTokenizer;
9use Phpml\CrossValidation\StratifiedRandomSplit;
10use Phpml\FeatureExtraction\TfIdfTransformer;
11use Phpml\Metric\Accuracy;
12use Phpml\Classification\SVC;
13use Phpml\SupportVectorMachine\Kernel;
14
15class Classify
16{
17    protected $classifier;
18
19    protected $vectorizer;
20
21    protected $tfIdfTransformer;
22
23    public function __construct()
24    {
25        $this->dataset = new CsvDataset('data/letters.csv', 1);
26        $this->vectorizer = new TokenCountVectorizer(new WordTokenizer());
27        $this->tfIdfTransformer = new TfIdfTransformer();
28        $samples = [];
29        foreach ($this->dataset->getSamples() as $sample) {
30                $samples[] = $sample[0];
31        }
32        $this->vectorizer->fit($samples);
33        $this->vectorizer->transform($samples);
34        $this->tfIdfTransformer->fit($samples);
35        $this->tfIdfTransformer->transform($samples);
36        $dataset = new ArrayDataset($samples, $this->dataset->getTargets());
37        $randomSplit = new StratifiedRandomSplit($dataset, 0.1);
38        $this->classifier = new SVC(Kernel::RBF, 10000);
39        $this->classifier->train($randomSplit->getTrainSamples(), $randomSplit->getTrainLabels());
40        $predictedLabels = $this->classifier->predict($randomSplit->getTestSamples());
41        echo 'Accuracy: '.Accuracy::score($randomSplit->getTestLabels(), $predictedLabels);
42    }
43
44    public function __invoke(array $message)
45    {
46        $newSample = [$message['content']];
47        $this->vectorizer->transform($newSample);
48        $this->tfIdfTransformer->transform($newSample);
49        $message['classification'] = $this->classifier->predict($newSample)[0];
50        return $message;
51    }
52}

In our constructor, we train up our model by passing our sample data through the following steps:

First, we use the token count vectorizer to convert our samples to a vector of token counts - replacing every word with a number and keeping track of how often that word occurs.
Next, we use TfIdfTransformer to get statistics about how important a word is in a document.
Then we instantiate our classifier and train it on a random subset of our data.
Finally, we pass our message to our now-trained classifier and see what it tells us.

Now, bear in mind I don't have a background in machine learning and this is the first time I've done anything with machine learning, so I can't tell you much more than that - if you want to know more I suggest you investigate on your own. In figuring this out I was helped a great deal by this article on Sitepoint, so you might want to start there.

The finished application is on GitHub, and the repository includes a CSV file of training data, as well as the examples folder, which contains some example PDF files. You can run it as follows:

$ php app process examples/Quote.pdf

I found that once I had trained it up using the CSV data from the repository, it was around 70-80% accurate, which isn't bad at all considering the comparatively small size of the dataset. If this were genuinely being used in production, there would be an extremely large dataset of historical scanned letters to use for training purposes, so it wouldn't be unreasonable to expect much better results under those circumstances.

Exercises for the reader

If you want to develop this concept further, here are some ideas:

We should be able to correct the model when it's wrong. Add a separate command to train the model by passing through a file and specifying how it should be categorised, eg php app train File.pdf quote.
Try processing information from different sources. For instance, you could replace the first two stages with a stage that pulls all unread emails from a specified mailbox using PHP's IMAP support, or fetching data from the Twitter API. Or you could have a telephony service such as Twilio set up as your voicemail, and automatically transcribe them, then pass the text to PHP ML for classification.
If you're multilingual, you could try adding a step to sort letters by language and have separate models for classifying in each language

Summary

It's actually quite a sobering thought that already it's possible to use techniques like these to produce tools that replace people in various jobs, and as the tooling matures more and more tasks involving classification are going to become amenable to automation using machine learning.

This was my first experience with machine learning and it's been very interesting for me to solve a real-world problem with it. I hope it gives you some ideas about how you could use it too.

php machine-learning