Matthew Daly's Blog

I'm a web developer in Norfolk. This is my blog...

23rd June 2018 1:03 pm

Forcing SSL in Codeigniter

I haven’t started a new CodeIgniter project since 2014, and don’t intend to, but on occasion I’ve been asked to do maintenance work on legacy CodeIgniter projects. This week I was asked to help out with a situation where a CodeIgniter site was being migrated to HTTPS and there were issues resulting from the migration.

Back in 2012, when working on my first solo project, I’d built a website using CodeIgniter that used HTTPS, but also needed to support an affiliate marketing system that did not support it, so certain pages had to force HTTP, and others had to force HTTPS, so I’d used the hook system to create hooks to enforce this. This kind of requirement is unlikely to reoccur now because HTTPS is becoming more prevalent, but sometimes it may be easier to enforce HTTPS at application level than in the web server configuration or using htaccess. It’s relatively straightforward to do that in CodeIgniter.

The first step is to create the hook. Save this as application/hooks/ssl.php:

<?php
function force_ssl()
{
$CI =& get_instance();
$CI->config->config['base_url'] = str_replace('http://', 'https://', $CI->config->config['base_url']);
if ($_SERVER['SERVER_PORT'] != 443) redirect($CI->uri->uri_string());
}
?>

Next, we register the hook. Update application/configs/hooks.php as follows:

<?php if ( ! defined('BASEPATH')) exit('No direct script access allowed');
/*
| -------------------------------------------------------------------------
| Hooks
| -------------------------------------------------------------------------
| This file lets you define "hooks" to extend CI without hacking the core
| files. Please see the user guide for info:
|
| http://codeigniter.com/user_guide/general/hooks.html
|
*/
$hook['post_controller_constructor'][] = array(
'function' => 'force_ssl',
'filename' => 'ssl.php',
'filepath' => 'hooks'
);
/* End of file hooks.php */
/* Location: ./application/config/hooks.php */

This tells CodeIgniter that it should looks in the application/hooks directory for a file called ssl.php, and return the function force_ssl.

Finally, we enable hooks. Update application/config/config.php:

$config['enable_hooks'] = TRUE;

If you only want to force SSL in production, not development, you may want to amend the ssl.php file to only perform the redirect in non-development environments, perhaps by using an environment variable via DotEnv.

3rd June 2018 4:30 pm

Logging to the ELK Stack With Laravel

Logging to text files is the simplest and most common logging setup for web apps, and it works fine for relatively small and simple applications. However, it does have some downsides:

  • It’s difficult to make the log files accessible - normally users have to SSH in to read them.
  • The tools used to filter and analyse log files have a fairly high technical barrier to access - grep and sed are not exactly easy for non-programmers to pick up, so business information can be hard to get.
  • It’s hard to visually identify trends in the data.
  • Log files don’t let you know immediately when something urgent happens
  • You can’t access logs for different applications through the same interface.

For rare, urgent issues where you need to be informed immediately they occur, it’s straightforward to log to an instant messaging solution such as Slack or Hipchat. However, these aren’t easily searchable, and can only be used for the most important errors (otherwise, there’s a risk that important data will be lost in the noise). There are third-party services that allow you to search and filter your logs, but they can be prohibitively expensive.

The ELK stack has recently gained a lot of attention as a sophisticated solution for logging application data. It consists of:

  • Logstash for processing log data
  • Elasticsearch as a searchable storage backend
  • Kibana as a web interface

By making the log data available using a powerful web interface, you can easily expose it to non-technical users. Kibana also comes with powerful tools to aggregate and filter the data. In addition, you can run your own instance, giving you a greater degree of control (as well as possibly being more cost-effective) compared to using a third-party service.

In this post I’ll show you how to configure a Laravel application to log to an instance of the ELK stack. Fortunately, Laravel uses the popular Monolog logging library by default, which is relatively easy to get to work with the ELK stack. First, we need to install support for the GELF logging format:

$ composer require graylog2/gelf-php

Then, we create a custom logger class:

<?php
namespace App\Logging;
use Monolog\Logger;
use Monolog\Handler\GelfHandler;
use Gelf\Publisher;
use Gelf\Transport\UdpTransport;
class GelfLogger
{
/**
* Create a custom Monolog instance.
*
* @param array $config
* @return \Monolog\Logger
*/
public function __invoke(array $config)
{
$handler = new GelfHandler(new Publisher(new UdpTransport($config['host'], $config['port'])));
return new Logger('main', [$handler]);
}
}

Finally, we configure our application to use this as our custom driver and specify the host and port in config/logging.php:

'custom' => [
'driver' => 'custom',
'via' => App\Logging\GelfLogger::class,
'host' => '127.0.0.1',
'port' => 12201,
],

You can then set up whatever logging channels you need for your application, and specify whatever log level you feel is appropriate.

Please note that this requires at least Laravel 5.6 - this file doesn’t exist in Laravel 5.5 and earlier, so you may have more work on your hands to integrate it with older versions.

If you already have an instance of the ELK stack set up on a remote server that’s already set up to accept input as GELF, then you should be able to point it at that and you’ll be ready to go. If you just want to try it out, I’ve been using a Docker-based project that makes it straightforward to run the whole stack locally. However, you will need to amend logstash/pipeline/logstash.conf as follows to allow it to accept log data:

input {
tcp {
port => 5000
}
gelf {
port => 12201
type => gelf
codec => "json"
}
}
## Add your filters / logstash plugins configuration here
output {
elasticsearch {
hosts => "elasticsearch:9200"
}
}

Then you can start it up using the instructions in the repository and it should be ready to go. Now, if you run the following command from Tinker:

Log::info('Just testing');

Then if you access the web interface, you should be able to find that log message without any difficulty.

Now, this only covers the Laravel application logs. You may well want to pass other logs through to Logstash, such as Apache, Nginx or MySQL logs, and a quick Google should be sufficient to find ideas on how you might log for these services. Creating visualisations with Kibana is a huge subject, and the existing documentation covers that quite well, so if you’re interested in learning more about that I’d recommend reading the documentation and having a play with the dashboard.

13th May 2018 2:55 pm

Full-text Search With Mariadb

Recently I had the occasion to check out MariaDB’s implementation of full-text search. As it’s a relatively recent arrival in MySQL and MariaDB, it doesn’t seem to get all that much attention. In this post I’ll show you how to use it, with a few Laravel-specific pointers. We’ll be using the default User model in a new Laravel installation, which has columns for name and email.

Our first task is to create the fulltext index, which is necessary to perform the query. Run the following command:

ALTER TABLE users ADD FULLTEXT (name, email);

As you can see, we can specify multiple columns in our table to index.

If you’re using Laravel, you’ll want to create the following migration for this:

<?php
use Illuminate\Support\Facades\Schema;
use Illuminate\Database\Schema\Blueprint;
use Illuminate\Database\Migrations\Migration;
class AddFulltextIndexForUsers extends Migration
{
/**
* Run the migrations.
*
* @return void
*/
public function up()
{
DB::statement('ALTER TABLE users ADD FULLTEXT(name, email)');
}
/**
* Reverse the migrations.
*
* @return void
*/
public function down()
{
DB::statement('ALTER TABLE users DROP INDEX IF EXISTS name');
}
}

Note that the index is named after the first field passed to it, so when we drop it we refer to it as name. Then, to actually query the index, you should run a command something like this:

SELECT * FROM users WHERE MATCH(name, email) AGAINST ('jeff' IN NATURAL LANGUAGE MODE);

Note that NATURAL LANGUAGE MODE is actually the default, so you can leave it off if you wish. We also have to specify the columns to match against.

If you’re using Laravel, you may want to create a reusable local scope for it:

public function scopeSearch($query, $search)
{
if (!$search) {
return $query;
}
return $query->whereRaw('MATCH(name, email) AGAINST (?)', [$search]);
}

Then you can call it as follows:

User::search('jeff')->get();

I personally have noticed that the query using the MATCH keywords seems to be far more performant, with the response time being between five and ten times less than a similar command using LIKE, however this observation isn’t very scientific (plus, we are talking about queries that still run in a fraction of a second). However, if you’re doing a particularly expensive query that currently uses a LIKE statement, it’s possible you may get better results by switching to a MATCH statement. Full-text search probably isn’t all that useful in this context - it’s only once we’re talking about longer text, such as blog posts, that some of the advantages like support for stopwords comes into play.

From what I’ve seen this implementation of full-text search is a lot simpler than in PostgreSQL, which has ups and downs. On the one hand, it’s a lot easier to implement, but conversely it’s less useful - there’s no obvious way to perform a full-text search against joined tables. However, it does seem to be superior to using a LIKE statement, so it’s probably a good fit for smaller sites where something like Elasticsearch would be overkill.

10th May 2018 11:50 pm

Building a Letter Classifier in PHP With Tesseract OCR and PHP ML

PHP isn’t the first language that springs to mind when it comes to machine learning. However, it is practical to use PHP for machine learning purposes. In this tutorial I’ll show you how to build a pipeline for classifying letters.

The brief

Before I was a web dev, I was a clerical worker for an FTSE-100 insurance company, doing a lot of work that nowadays is possible to automate away, if you know how. When they received a letter or other communication from a client, it would be sent to be scanned on. Once scanned, a human would have to look at it to classify it, eg was it a complaint, a request for information, a request for a quote, or something else, as well as assign it to a policy number. Let’s imagine we’ve been asked to build a proof of concept for automating this process. This is a good example of a real-world problem that machine learning can help with.

As this is a proof of concept we aren’t looking to build a web app for this - for simplicity’s sake this will be a command-line application. Unlike emails, letters don’t come in an easily machine-readable format, so we will be receiving them as PDF files (since they would have been scanned on, this is a reasonable assumption). Feel free to mock up your own example letters using your own classifications, but I will be classifying letters into four groups:

  • Complaints - letters expressing dissatisfaction
  • Information requests - letters requesting general information
  • Surrender quotes - letters requesting a surrender quote
  • Surrender forms - letters requesting surrender forms

Our application will therefore take in a PDF file at one end, and perform the following actions on it:

  • Convert the PDF file to a PNG file
  • Use OCR (optical character recognition) to convert the letter to plain text
  • Strip out unwanted whitespace
  • Extract any visible policy number from the text
  • Use a machine learning library to classify the letter, having taught it using prior examples

Sound interesting? Let’s get started…

Introducing pipelines

As our application will be carrying out a series of discrete steps on our data, it makes sense to use the pipeline pattern for this project. Fortunately, the PHP League have produced a excellent package implementing this. We can therefore create a single class for each step in the process and have it handle that in isolation.

We’ll also use the Symfony Console component to implement our command-line application. For our machine learning library we will be using PHP ML, which requires PHP 7.1 or greater. For OCR, we will be using Tesseract, so you will need to install the underlying Tesseract OCR library, as well as support for your language. On Ubuntu you can install these as follows:

$ sudo apt-get install tesseract-ocr tesseract-ocr-eng

This assumes you are using English, however you should be able to find packages to support many other languages. Finally, we need ImageMagick to be installed in order to convert PDF files to PNG’s.

Your composer.json should look something like this:

{
"name": "matthewbdaly/letter-classifier",
"description": "Demo of classifying letters in PHP",
"type": "project",
"require": {
"league/pipeline": "^0.3.0",
"thiagoalessio/tesseract_ocr": "^2.2",
"php-ai/php-ml": "^0.6.2",
"symfony/console": "^4.0"
},
"require-dev": {
"phpspec/phpspec": "^4.3",
"psy/psysh": "^0.8.17"
},
"autoload": {
"psr-4": {
"Matthewbdaly\\LetterClassifier\\": "src/"
}
},
"license": "MIT",
"authors": [
{
"name": "Matthew Daly",
"email": "matthewbdaly@gmail.com"
}
]
}

Next, let’s write the outline of our command-line client. We’ll load a single class for our processor command. Save this as app:

#!/usr/bin/env php
<?php
require __DIR__.'/vendor/autoload.php';
use Symfony\Component\Console\Application;
use Matthewbdaly\LetterClassifier\Commands\Processor;
$application = new Application();
$application->add(new Processor());
$application->run();

Next, we create our command. Save this as src/Commands/Processor.php:

<?php
namespace Matthewbdaly\LetterClassifier\Commands;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
use Symfony\Component\Console\Input\InputArgument;
use League\Pipeline\Pipeline;
use Matthewbdaly\LetterClassifier\Stages\ConvertPdfToPng;
use Matthewbdaly\LetterClassifier\Stages\ReadFile;
use Matthewbdaly\LetterClassifier\Stages\Classify;
use Matthewbdaly\LetterClassifier\Stages\StripTabs;
use Matthewbdaly\LetterClassifier\Stages\GetPolicyNumber;
class Processor extends Command
{
protected function configure()
{
$this->setName('process')
->setDescription('Processes a file')
->setHelp('This command processes a file')
->addArgument('file', InputArgument::REQUIRED, 'File to process');
}
protected function execute(InputInterface $input, OutputInterface $output)
{
$file = $input->getArgument('file');
$pipeline = (new Pipeline)
->pipe(new ConvertPdfToPng)
->pipe(new ReadFile)
->pipe(new StripTabs)
->pipe(new GetPolicyNumber)
->pipe(new Classify);
$response = $pipeline->process($file);
$output->writeln("Classification is ".$response['classification']);
$output->writeln("Policy number is ".$response['policy']);
}
}

Note how our command accepts the file name as an argument. We then instantiate our pipeline and pass it through a series of classes, each of which has a single role. Finally, we retrieve our response and output it.

With that done, we can move on to implementing our first step. Save this as src/Stages/ConvertPdfToPng.php:

<?php
namespace Matthewbdaly\LetterClassifier\Stages;
use Imagick;
class ConvertPdfToPng
{
public function __invoke($file)
{
$tmp = tmpfile();
$uri = stream_get_meta_data($tmp)['uri'];
$img = new Imagick();
$img->setResolution(300, 300);
$img->readImage($file);
$img->setImageDepth(8);
$img->setImageFormat('png');
$img->writeImage($uri);
return $tmp;
}
}

This stage fetches the file passed through, and converts it into a PNG file, stores it as a temporary file, and returns a reference to it. The output of this stage will then form the input of the next. This is how pipelines work, and it makes it easy to break up a complex process into multiple steps that can be reused in different places, facilitating easier code reuse and making your code simpler to understand and reason about.

Our next step carries out optical character recognition. Save this as src/Stages/ReadFile.php:

<?php
namespace Matthewbdaly\LetterClassifier\Stages;
use thiagoalessio\TesseractOCR\TesseractOCR;
class ReadFile
{
public function __invoke($file)
{
$uri = stream_get_meta_data($file)['uri'];
$ocr = new TesseractOCR($uri);
return $ocr->lang('eng')->run();
}
}

As you can see, this accepts the link to the temporary file as an argument, and runs Tesseract on it to retrieve the text. Note that we specify a language of eng - if you want to use a language other than English, you should specify it here.

At this point, we should have some usable text, but there may be unknown amounts of whitespace, so our next step uses a regex to strip them out. Save this as src/Stages/StripTabs.php:

<?php
namespace Matthewbdaly\LetterClassifier\Stages;
class StripTabs
{
public function __invoke($content)
{
return trim(preg_replace('/\s+/', ' ', $content));
}
}

With our whitespace issue sorted out, we now need to retrieve the policy number the communication should be filed under. These are generally regular alphanumeric patterns, so regexes are a suitable way of matching them. As this is a proof of concept, we’ll assume a very simple pattern for policy numbers in that they will consist of between seven and nine digits. Save this as src/Stages/GetPolicyNumber.php:

<?php
namespace Matthewbdaly\LetterClassifier\Stages;
class GetPolicyNumber
{
public function __invoke($content)
{
$matches = [];
$policyNumber = '';
preg_match('/\d{7,9}/', $content, $matches);
if (count($matches)) {
$policyNumber = $matches[0];
}
return [
'content' => $content,
'policy' => $policyNumber
];
}
}

Finally, we’re onto the really tough part - using machine learning to classify the letters. Save this as src/Stages/Classify.php:

<?php
namespace Matthewbdaly\LetterClassifier\Stages;
use Phpml\Dataset\CsvDataset;
use Phpml\Dataset\ArrayDataset;
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;
use Phpml\CrossValidation\StratifiedRandomSplit;
use Phpml\FeatureExtraction\TfIdfTransformer;
use Phpml\Metric\Accuracy;
use Phpml\Classification\SVC;
use Phpml\SupportVectorMachine\Kernel;
class Classify
{
protected $classifier;
protected $vectorizer;
protected $tfIdfTransformer;
public function __construct()
{
$this->dataset = new CsvDataset('data/letters.csv', 1);
$this->vectorizer = new TokenCountVectorizer(new WordTokenizer());
$this->tfIdfTransformer = new TfIdfTransformer();
$samples = [];
foreach ($this->dataset->getSamples() as $sample) {
$samples[] = $sample[0];
}
$this->vectorizer->fit($samples);
$this->vectorizer->transform($samples);
$this->tfIdfTransformer->fit($samples);
$this->tfIdfTransformer->transform($samples);
$dataset = new ArrayDataset($samples, $this->dataset->getTargets());
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);
$this->classifier = new SVC(Kernel::RBF, 10000);
$this->classifier->train($randomSplit->getTrainSamples(), $randomSplit->getTrainLabels());
$predictedLabels = $this->classifier->predict($randomSplit->getTestSamples());
echo 'Accuracy: '.Accuracy::score($randomSplit->getTestLabels(), $predictedLabels);
}
public function __invoke(array $message)
{
$newSample = [$message['content']];
$this->vectorizer->transform($newSample);
$this->tfIdfTransformer->transform($newSample);
$message['classification'] = $this->classifier->predict($newSample)[0];
return $message;
}
}

In our constructor, we train up our model by passing our sample data through the following steps:

  • First, we use the token count vectorizer to convert our samples to a vector of token counts - replacing every word with a number and keeping track of how often that word occurs.
  • Next, we use TfIdfTransformer to get statistics about how important a word is in a document.
  • Then we instantiate our classifier and train it on a random subset of our data.
  • Finally, we pass our message to our now-trained classifier and see what it tells us.

Now, bear in mind I don’t have a background in machine learning and this is the first time I’ve done anything with machine learning, so I can’t tell you much more than that - if you want to know more I suggest you investigate on your own. In figuring this out I was helped a great deal by this article on Sitepoint, so you might want to start there.

The finished application is on GitHub, and the repository includes a CSV file of training data, as well as the examples folder, which contains some example PDF files. You can run it as follows:

$ php app process examples/Quote.pdf

I found that once I had trained it up using the CSV data from the repository, it was around 70-80% accurate, which isn’t bad at all considering the comparatively small size of the dataset. If this were genuinely being used in production, there would be an extremely large dataset of historical scanned letters to use for training purposes, so it wouldn’t be unreasonable to expect much better results under those circumstances.

Exercises for the reader

If you want to develop this concept further, here are some ideas:

  • We should be able to correct the model when it’s wrong. Add a separate command to train the model by passing through a file and specifying how it should be categorised, eg php app train File.pdf quote.
  • Try processing information from different sources. For instance, you could replace the first two stages with a stage that pulls all unread emails from a specified mailbox using PHP’s IMAP support, or fetching data from the Twitter API. Or you could have a telephony service such as Twilio set up as your voicemail, and automatically transcribe them, then pass the text to PHP ML for classification.
  • If you’re multilingual, you could try adding a step to sort letters by language and have separate models for classifying in each language

Summary

It’s actually quite a sobering thought that already it’s possible to use techniques like these to produce tools that replace people in various jobs, and as the tooling matures more and more tasks involving classification are going to become amenable to automation using machine learning.

This was my first experience with machine learning and it’s been very interesting for me to solve a real-world problem with it. I hope it gives you some ideas about how you could use it too.

29th April 2018 8:59 pm

Console Applications With the Symfony Console Component

Recently I’ve had the occasion to add a series of console commands to a legacy application. This can be made straightforward by using the Symfony console component. In this post I’ll demonstrate how to write a simple console command for clearing a cache folder.

The first step is to install the Console component:

$ composer require symfony/console

Then we write the main script for the application. I usually save mine as console - note that we don’t want to have to type out a file extension, so instead we use the shebang:

#!/user/bin/env php
<?php
require __DIR__.'/vendor/autoload.php';
use Symfony\Component\Console\Application;
define('CONSOLE_ROOT', __DIR__);
$app = new Application();
$app->run();

In this case, I’ve defined CONSOLE_ROOT as the directory in which the console command is run - that way, the commands can use it to refer to the application root.

We can then run our console application as follows:

$ php console
Console Tool
Usage:
command [options] [arguments]
Options:
-h, --help Display this help message
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi Force ANSI output
--no-ansi Disable ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
Available commands:
help Displays help for a command
list Lists commands

This displays the available commands, but you’ll note that there are none except for help and list. We’ll remedy that. First, we’ll register a command:

$app->add(new App\Console\ClearCacheCommand);

This has to be done in console, after we create $app, but before we run it.

Don’t forget to update the autoload section of your composer.json to register the namespace:

"autoload": {
"psr-4": {
"App\\Console\\": "src/Console/"
}
},

Then create the class for that command. This class must extend Symfony\Component\Console\Command\Command, and must have two methods:

  • configure()
  • execute()

In addition, the execute() method must accept two arguments, an instance of Symfony\Component\Console\Input\InputInterface, and an instance of Symfony\Component\Console\Output\OutputInterface. There are used to retrieve input and display output.

Let’s write our command:

<?php
namespace App\Console;
use Symfony\Component\Console\Command\Command;
use Symfony\Component\Console\Input\InputInterface;
use Symfony\Component\Console\Output\OutputInterface;
class ClearCacheCommand extends Command
{
protected function configure()
{
$this->setName('cache:clear')
->setDescription('Clears the cache')
->setHelp('This command clears the application cache');
}
protected function execute(InputInterface $input, OutputInterface $output)
{
$dir = CONSOLE_ROOT.DIRECTORY_SEPARATOR.'cache';
$this->deleteTree($dir);
$output->writeln('Cache cleared');
}
private function deleteTree($dir)
{
$files = array_diff(scandir($dir), array('.','..'));
foreach ($files as $file) {
(is_dir("$dir/$file")) ? $this->deleteTree("$dir/$file") : unlink("$dir/$file");
}
return rmdir($dir);
}
}

As you can see, in the configure() method, we set the name, description and help text for the command.

The execute() method is where the actual work is done. In this case, we have some code that needs to be called recursively, so we have to pull it out into a private method. Once that’s done we use $output->writeln() to write a line to the output.

Now, if we run our console task, we should see our new command:

$ php console
Console Tool
Usage:
command [options] [arguments]
Options:
-h, --help Display this help message
-q, --quiet Do not output any message
-V, --version Display this application version
--ansi Force ANSI output
--no-ansi Disable ANSI output
-n, --no-interaction Do not ask any interactive question
-v|vv|vvv, --verbose Increase the verbosity of messages: 1 for normal output, 2 for more verbose output and 3 for debug
Available commands:
help Displays help for a command
list Lists commands
cache
cache:clear Clears the cache

And we can see it in action too:

$ php console cache:clear
Cache cleared

For commands that need to accept additional arguments, you can define them in the configure() method:

$this->addArgument('file', InputArgument::REQUIRED, 'Which file do you want to delete?')

Then, you can access it in the execute() method using InputInterface:

$file = $input->getArgument('file');

This tutorial is just skimming the surface of what you can do with the Symfony Console components - indeed, many other console interfaces, such as Laravel’s Artisan, are built on top of it. If you have a legacy application built in a framework that lacks any sort of console interface, such as CodeIgniter, then you can quite quickly produce basic console commands for working with that application. The documentation is very good, and with a little work you can soon have something up and running.

Recent Posts

Forcing SSL in Codeigniter

Logging to the ELK Stack With Laravel

Full-text Search With Mariadb

Building a Letter Classifier in PHP With Tesseract OCR and PHP ML

Console Applications With the Symfony Console Component

About me

I'm a web and mobile app developer based in Norfolk. My skillset includes Python, PHP and Javascript, and I have extensive experience working with CodeIgniter, Laravel, Django, Phonegap and Angular.js.