Matthew Daly's Blog

I'm a web developer in Norfolk. This is my blog...

29th January 2012 7:52 pm

So You REALLY Don't Know Regular Expressions?

Ever since I started my new job, I’ve noticed a curious phenomenon. I work with two wonderfully gifted programmers who both know PHP much better than I do, and I learn something new from them all the time. However, neither one of them really knows or uses regular expressions.

Now, as I learned Perl before I learned PHP, naturally I learned regular expressions quite early on in that process. In Perl, regular expressions are a huge part of the language - you simply cannot get away without learning them to some extent as they are used extensively in so many parts of the language.

Apparently I’m not the only one to notice this. Here’s a quote I found on Stack Exchange:

In earlier phases of my career (ie. pre-PHP), I was a Perl guru, and one major aspect of Perl gurudom is mastery of regular expressions.

On my current team, I’m literally the only one of us who reaches for regex before other (usually nastier) tools. Seems like to the rest of the team they’re pure magic. They’ll wheel over to my desk and ask for a regex that takes me literally ten seconds to put together, and then be blown away when it works. I don’t know–I’ve worked with them so long, it’s just natural at this point.

In the absence of regex-fluency, you’re left with combinations of flow-control statements wrapping strstr and strpos statements, which gets ugly and hard to run in your head. I’d much rather craft one elegant regex than thirty lines of plodding string searching.

While I would hesitate to call myself a Perl guru (at best I would call myself intermediate with Perl), I would say I know enough about regular expressions that I can generally get useful work done with them.

Take the following example in Perl (edited somewhat as it didn’t play nice with TinyMCE):

$fruit = "apple,banana,cherry";
print $fruit;
@fruit = split(/,/,$fruit);
foreach(@fruit)
{
print $_."\n";
}
apple,banana,cherry
apple
banana
cherry

Now, this code should be fairly easy to understand, even if you don’t really know Perl. $fruit is a string containing “apple,banana,cherry”. The split() function takes two arguments, a regular expression defining the character(s) that are used to separate the parts of the string you want to put into an array, and the string you want to split. This returns the array @fruit, which consists of three strings, “apple’, “banana”, and “cherry”.

In PHP, you can do pretty much the same thing, using the explode() function:

<?php
$fruit = "apple,banana,cherry";
echo $fruit."\n";
$fruitArray = explode(",",$fruit);
foreach($fruitArray as $fruitArrayItem)
{
echo $fruitArrayItem."\n";
}
?>
apple,banana,cherry
apple
banana
cherry

As you can see, they work in pretty much the same way here. Both return basically the same output, and the syntax for using the appropriate functions for splitting the strings is virtually identical.

However, it’s once things get a bit more difficult that it becomes obvious how much more powerful regular expressions are. Say you’re dealing with a string that’s similar to that above, but may use different characters to separate the elements. For instance, say you’ve obtained the data that you want to pass through into an array from a text file and it’s somewhat inconsistent - perhaps the information you want is separated by differing amounts and types of whitespace, or different characters. The explode() function simply won’t handle that (at least, not without a lot of pain). But with Perl’s split() function, that’s no problem. Here’s how you might deal with input that had different types and quantities of whitespace as a separator:

@fruit = split(/\s+/,$fruit);

Yes, it’s that simple! The \s metacharacter matches any type of whitespace, and the + modifier means that it will match one or more times. Now you can very easily convert the contents of that string into an array.

Or say you want to convert an entire string of text, with all kinds of punctuation and whitespace, into an array, but only keep the actual words. This wouldn’t be practical with explode(), but with split() it’s easy:

@fruit = split(/\W+/,$fruit);

The \W metacharacter matches any non-word character (ie anything other than a-z, A-Z or 0-9), and again the + modifier means that it will match one or more times.

And of course, regular expressions are useful for many more tasks than this that, while possible with most language’s existing string functions, can get very nasty quite quickly. Say you want to match a UK postcode to check that it’s valid (note that for the sake of simplicity, I’m going to ignore BFPO and GIR postcodes). These use a format of one or two letters, followed by one digit, then may have an additional digit or letter, then a space, then a digit, then two letters. This would be a nightmare to check using most language’s native string functions, but with a regex in Perl, it’s relatively simple:

my $postcode = "NR1 1NP";
if($postcode =~ m/^[a-zA-Z]{1,2}\d{1}(|[a-zA-Z0-9]{1})(|\s+)\d{1}\w{2}$/)
{
print "It matched!\n";
}

And if you wanted to return the first part of the postcode if it matched as well, that’s simple too:

my $postcode = "NR1 1NP";
if($postcode =~ s/^([a-zA-Z]{1,2}\d{1}(|[a-zA-Z0-9]{1}))(|\s+)\d{1}\w{2}$/$1/)
{
print "It matched! $postcode\n";
}

Now, you may say “But that’s in Perl! I’m using PHP!’. Well, regular expressions are an extremely powerful part of PHP that are very useful, they’re just not as central to the language as they are in Perl. PHP actually has two distinct types of regular expressions - POSIX-extended regular expressions, and Perl-compatible regular expressions (or PCRE). However, POSIX-extended regular expressions were deprecated from PHP 5.3 onwards, so it’s not really worth taking the time to learn them when PCRE will do exactly the same thing and is going to be around for the future. Furthermore, most other programming languages also support Perl-compatible regular expressions, so they’re fairly portable between languages, and once you’ve learned them in one language, you can easily use them in another. In other words, if you learn how to work with regular expressions in Perl, you can very easily transfer that knowledge to most other programming languages that support regular expressions.

In the first example given above, we can replace explode() with preg_split, and the syntax is virtually identical to split() in Perl, with the only difference being the name of the function and that the pattern to match is wrapped in double quotes:

<?php
$fruit = "apple,banana,cherry";
echo $fruit."\n";
$fruitArray = preg_split("/,/",$fruit);
foreach($fruitArray as $fruitArrayItem)
{
echo $fruitArrayItem."\n";
}
?>
apple,banana,cherry
apple
banana
cherry

Along similar lines, if we want to check if a string matches a pattern, we can use preg_match(), and if we want to search and replace, we can use preg_replace(). PHP’s regular expression support is not appreciably poorer than Perl’s, even if it’s less central to the language as a whole.

But regular expressions are slower than PHP’s string functions!

Yes, that’s true. So it’s a mistake to use regular expressions for something that can be handled quickly and easily using string functions. For instance, if in the following string you wanted to replace the word “cow” with “sheep”:

The cow jumped over the moon

You could use something like this:

<?php
$text = "The cow jumped over the moon";
$text = preg_replace("/cow/","sheep",$text);
?>

However, because here you are only looking to match literal characters, you don’t need to use a regular expression. Just use the following:

<?php
$text = str_replace("cow","sheep",$text);
?>

But, if you have to do some more complex pattern matching, you have to start using strpos to get the location of specific characters and returning substrings between those characters, and it gets very messy, very quickly indeed. In those cases, while I haven’t done any kind of benchmarking on it, it stands to reason that quite quickly you’ll reach a point where a regex would be faster.

However, for a number of common tasks, such as validating email addresses and URLs, there’s another way and you don’t need to resort to regular expressions, or faffing about with loads of string functions. The filter_var() function can be used for validating or sanitising email addresses and URLs, among other things, so this is worth using instead of writing a regex. If you’re using a framework such as CodeIgniter, you may have access to its native functions for validating this kind of thing, so you should use those instead.

But regular expressions are ugly and make for less readable code!

Not really. They seem intimidating to the newcomer, and very few people can just glance at a regex and instantly know what it does. But with regexes, you can often do complex things in far fewer lines of code than would be needed to accomplish the same thing using just PHP’s string functions. If you can do something in a line or two using string functions, it’s probably best to do that. But after that, things go downhill very quickly.

Once you learn them, regular expressions really are not that hard, and you’ll probably find enough things to use them for that you’ll get plenty of practice at them. They’re certainly more readable to anyone with even a modicum of experience using them than line after line of flow-control statements.

But you shouldn’t be using regular expressions for parsing HTML or XML!

Quite true. Regular expressions are the wrong tool for that. You should probably use an existing library of some kind for that.

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

Ah, yes, surely one of the most misused quotes on the web! Again, regular expressions are not the right tool for every job, and there’s a lot of tasks they get used for, and quite frankly, shouldn’t be. Most of us who know regular expressions have been known to use them for things we probably shouldn’t (I actually only just stumbled across filter_var, so I’ve done my share of validating email addresses using regexes, and I’m as guilty as anyone else of overusing them). But there’s still plenty of stuff you should use it for when what you need to do can’t be accomplished quickly and easily using string functions.

Regular expressions are not inherently evil. They’re a tool like any other. What is bad is using them for things where a simple alternative exists. However, they are still extremely useful, and there’s plenty of valid use cases for them.

13th January 2012 7:25 pm

Github

To date, Subversion is the single versioning system I have the most experience with. I use it at work, and I was already somewhat familiar with it beforehand. However, with all the buzz over Git over the last few years, it’s always been tempting to explore that as an alternative.

I’ve had a Github account for over a year, but had as yet not added anything to it. However, today that changed. I’ve had a rather haphazard approach towards my .vimrc and other Vim configuration files for a while, with the result that they tend to be less than consistent across different machines. I’ve seen that a fair number of people put their Vim configuration files under version control, and that seemed like an effective solution, so I’ve gotten my .vimrc and .vim into a respectable state and added them to a new repository. Now I should have no excuse for letting them get out of sync.

I have to say, Github is a truly wonderful service. The tutorials for getting started with Git are really good, and make it easy to get started. It’s probably one of the main reasons why Git is becoming more and more popular- there isn’t really anything comparable for Subversion.

24th October 2011 10:18 pm

Linux in the Workplace

At the start of September I left my customer services role and started a new position as a web developer. I won’t give the name of either my old or new employer, but I will say that the new role is with a much smaller company, and the part I work for now is an e-commerce store that enjoys a significant degree of independence from the parent company. There are only two developers including myself, and we are solely responsible for the company’s IT infrastructure, and we don’t have the hassle of dealing with legacy applications or infrastructure. We therefore have considerable freedom in terms of what we choose to use to get our work done.

When I first started, I used Windows XP Professional since that was what my work laptop came with, but it soon became obvious that there wasn’t actually anything I specifically needed to be using Windows for. I mostly work on the company’s intranet, which doesn’t really need to be tested in Internet Explorer as we use Firefox internally. For email and calendar, we use Google Apps, which works fine with virtually any email client that supports IMAP, so I was using Thunderbird with the Lightning plugin. When coding I used Netbeans with the jVi plugin for most of my work, with occasional usage of Vim for writing shorter scripts. I used AppServ to provide local versions of Apache, MySQL and PHP, and I used PHPMyAdmin to interact with the database. For version control, I used Subversion. From time to time I need to remote into another machine using VNC, SSH or RDP, for which I used mRemote, but I was confident I could find an equivalent application. Also, we use Ubuntu on most of our servers, so it made a lot of sense from a compatibility point of view to also use it on my own desktop. From time to time, I also found myself writing bash or Perl scripts for systems administration purposes, and since it wasn’t really very practical to do that in Windows when it was going to be running in Ubuntu, I’d used an Ubuntu Server install in Virtualbox to write it, but it was obvious that running Ubuntu as my desktop OS would make more sense.

As Ubuntu 11.10 was due a little over a month after I first started, I decided to hold off making the switch until then so I could start with the most recent version and not have the hassle of upgrading an existing install. I had already downloaded the 64-bit version of Ubuntu 11.10 for my home machines and burned them to a CD, so I brought the CD into work and set up a dual boot so I could revert back to XP if anything went wrong, and also so I could easily copy across any files I needed from the Windows partition.

It took a fair while to get everything I wanted installed, but a lot less time than it would have taken if I’d set up Windows XP from scratch. The hardware all worked fine out of the box, and most of the software I needed was in the repositories. The only thing that I really needed that wasn’t there was Netbeans (which has apparently now been removed from the repositories), but the version in the Ubuntu repositories has never been very up-to-date anyway. Instead I installed the version of Netbeans available on the website, and that has worked fine for me. While there wasn’t a version of mRemote available, I did discover Remmina, which has proven to be an excellent client for SSH, RDP and VNC, to the point that I’ve now stopped using the terminal to connect via SSH in favour of using Remmina instead. Thunderbird does just as good a job with my email and calendar as it does on Windows, and I also have Mutt available. Naturally, it couldn’t be simpler to install a full LAMP stack and PHPMyAdmin either. In fact, the only application that I use much that I couldn’t get a decent version of was MySQL Workbench, and that was only because Oracle haven’t yet released a version for Ubuntu 11.10 (tried the version for 11.04, but it doesn’t seem to work), but I can live without that.

What’s interesting is that despite all the scaremongering I’ve heard over the years about how Linux isn’t ready for the workplace, I’ve as yet had no problems whatsoever. For everything I used in Windows, it was either available on Ubuntu, or there was a viable equivalent, or I could get by fine without it. Granted, the nature of my work means I have little need for the small amount of functionality that Microsoft Office has and LibreOffice doesn’t, and I don’t need to use the kind of ghastly legacy apps written in Visual Basic that most large enterprises commonly use, but I haven’t noticed any significant barriers to my productivity.

In fact, if anything I’m considerably more productive. I know people like to rag on Unity, and I wasn’t happy with it in the netbook edition of Ubuntu 10.10 myself, but in 11.10 it’s really starting to show its promise, and I haven’t had any problems with it. The fact that I know Ubuntu a lot more thoroughly than I do Windows, purely from my own experience at home, means that I can get things done a lot quicker, but also the whole package management system means I’m largely free from the annoyances of opening an application in the morning to be confronted with an update dialogue, quite apart from the fact that very few updates require a restart. I’d go so far as to say that I’ve been more productive using Ubuntu at work than I would have been with either Windows 7 or OS X (and over the last few years I’ve used Windows Vista, Windows 7 and OS X fairly extensively).

I really don’t want this to turn into Yet Another Year of the Linux Desktop blog post, because that’s rather a tired old cliche, but I have absolutely no problems whatsoever getting my work done on Ubuntu. I’ll concede that as a developer I have significant freedom that isn’t often afforded to other people, and running some flavour of Unix makes a lot of sense if you’re a developer working with one of the open-source server-side languages such as PHP or Python (if I were a .NET developer, it would make rather less sense). I’m also lucky to be in a position where I don’t have to worry about legacy apps or IE compatibility too much. Nonetheless, it’s still remarkable how smoothly my migration across to Ubuntu on my work desktop has gone, and the extent to which I find it’s improved my workflow.

29th May 2011 2:53 pm

Hacked!

Had a rather unfortunate incident last month - someone hacked into my Pogoplug mail server, and managed to get their mitts on my .fetchmailrc, which had all the login details for several email accounts. They promptly began sending spam out using my Gmail account.

Naturally this meant I spent ages running round like a headless chicken trying to lock them out - when I first noticed that they’d been sending emails directly from my mail server, I logged into it via SSH and shut it down, then changed the passwords on all my email accounts.

Thinking logically, there were four services that I had forwarded ports to the server for - SSH, Apache, Postfix and Dovecot. Now, I was running SSH on a non-standard port, had disabled root access, and didn’t allow password authentication (SSH keys only). Also, I had enabled DenyHosts, so I’m fairly confident SSH was not the point of entry.

So that leaves either Apache, Postfix or Dovecot. I had noticed in the error logs a lot of characters prefixed with backslashes, and wondered if someone was trying some kind of shellcode injection, and to be safe I had added new iptables rules to blacklist the IP addresses responsible. I had done what I could to secure Apache, but I can’t rule it out as the application that was compromised. I went through the server logs, but without finding anything - I’m guessing whoever was responsible deleted the appropriate entries in the log files. I couldn’t be sure that the server could still be trusted, so I did a fresh install, and have disabled port forwarding on my router.

This has certainly made me much more cautious and suspicious about security, which I guess can’t be a bad thing. Even beforehand, I found it pretty scary to see the sheer number of script kiddies who will try to hack into any server on the Internet.

30th March 2011 8:34 pm

New Phone

On Friday of last week I unexpectedly got a text from Vodafone saying I was able to upgrade my phone early. I was pretty pleased about this as having been something of an Android early adopter, I was still using an early Android phone, namely my HTC Magic. While a fine phone when it was released, it was only the second Android phone to become available in the UK and was therefore a bit dated compared to newer devices. It has been upgraded to Froyo (albeit a cut-down custom build) but that did slow the phone down somewhat.

So as soon as I had the opportunity I had a good look around for a new one to replace it. Right from the start I had my eye on the HTC Desire Z. Much as I love touchscreen phones, it’s very often extremely handy to have a physical keyboard, and as I’ve found myself using ConnectBot to connect to my home server via SSH a lot, the keyboard-toting Desire Z immediately had an advantage over the touchscreen-only models. Ideally I didn’t want to change my plan, so I checked out the deals for HTC phones on the same plan, and the Desire Z happened to be the only one on the same plan, so it was a no-brainer.

I got the phone on Monday, and it is amazing. The keyboard is easy to use and works well, the phone is lightning fast, and the UI is spot-on - it has everything I love about Android on the Magic (like the great notification system) and more. In particular I love the RSS reader- it syncs with Google Reader, so if I have to wait for a train, I can at least read some feeds while I’m waiting.

One thing I’m hoping to get more use out of is SL4A. I had this on my Magic, but coding on a touchscreen phone is not easy! I’m hoping that with the Desire Z’s keyboard, this will be a lot more useful.

Recent Posts

What I Want in a PHP CMS

Flow Typed AJAX Responses With React Hooks

Caching the Laravel User Provider With a Decorator

The Trouble With Integrated Static Analysis

Don't Use Stdclass

About me

I'm a web and mobile app developer based in Norfolk. My skillset includes Python, PHP and Javascript, and I have extensive experience working with CodeIgniter, Laravel, Zend Framework, Django, Phonegap and React.js.