Tuesday, October 20, 2009

Regular expression 101 in php

Regular expressions are an old but powerful tool for pattern matching, as they date back to the 1960s and have survived till today. They can be used for specifying the logic structure of a string and come handy for validation of generic input or processing of string data, as valid substitutes of the classic string manipulation functions (which work at a lower level, character by character, since in php they are mutuated from C). Here's a presentation of basic regular expression-fu with examples written using php core libraries, and a running example as a php script.

In php, there were two engine for regular expressions: POSIX Regex (ereg(), now deprecated) and Perl Compatible Regular Expression. We are going to explore a little the capabilities of the PCRE engine, whose syntax is supported in many languages such as Javascript and Python.

Defining a pattern is the first requirement to use the regular expression engine. A pattern is string that respects a formal language and thus represents a (possibly infinite) set of ordinary strings. There is no physical difference with a normal string in php as the type of the variable is still string, but when passed to a preg_*() function it assumes a particular meaning. Other languages use objects for patterns storage, to provide type safety.
The pattern has to reflect the structure of the string you want to check. Typically the string is part of user or application input and you cannot anticipate its content. A regular expression is a mean for specifying its structure and reject the input in case it does not conform to the rules. Every time a part of a string conforms to a pattern, it is said that the pattern matches it (or the opposite, the string matches the pattern).

The simplest pattern you can write is a alphanumeric string. In PCRE, the pattern must be enclosed in a pair of slashes "/".
- '/foo/' matches 'foo', but also 'foooo' and 'my fooooo'.
Obviously a literal pattern has little utility, so it's better to specify a character range with a quantifier:
- '/([a-z]{1,})@gmail\.com/' matches any Google mail address composed by lowercase characters such as 'address@gmail.com'. The range is specified in square brackets [], where you can put single characters or alphanumeric intervals, separated with '-'. The possible repetition of the subpattern are from 1 to infinity in this example. The specification can be for instance zero to four times: {0,4}. The zero quantification is useful for patterns that may be absent. There are shortcuts quantifiers such as * and + but I think that for this demonstration a pattern that contains braces is clearer.
- 'gmail.com' matches 'address@gmail.com' but also 'gmailicom'. Beware that non-alphanumeric characters have special meaning when used out of a range, and should be backslashed if you want to match literal values. The dot character normally is a wildcard for anything different than a newline.
By using preg_match($pattern, $string), you can check that any input conforms to some simple rules. The function will return the number of times the pattern is found in the string (zero if not found at all).

Another great feature of using regular expression matching is the extraction of data via backreferences. Enclosing parts of the pattern in parentheses defines a subpattern whose actual content can be returned by preg_match.
$matches = array();
preg_match('/([a-z]{1,})@gmail.com/', 'address@gmail.com', $matches).
// $matches[1] now contains 'address.
$matches, the optional third argument of preg_match(), is an array passed by reference where subpatterns will be saved. The first element $matches[0], is always the part of string that matches the entire regular expression, while the subsequent ones are the subpatterns inserted.
Without regular expression, you would need to explode the string by different characters and navigate between the pieces to find out what you want, or doing complicated substring slicing and calculations. Regexes are the standard tool to do it efficiently, introducing less bugs and edge cases along the way.

Here is a script containing these examples, and many others, in the form of a phpunit test case. Seeing regular expressions at work and experiment by yourself it's the best way to learn the PCRE pattern syntax. Remember to test thoroughly your patterns and to not duplicate them across your application.
Regular expressions are a vast field and there are many topics to learn, like vertical bars and circumflex/dollar usage (|, ^ and $, respectively). The PCRE engine supports a lot of special characters and ranges.
These example patterns can be improved, too. I would be happy if you just start to consider the basic functionalities like I did, as they can save you from the nightmare of string manipulation functions. Although long patterns become quickly unreadable, they express intent better than two screens of substr() calls. You have a complete core php library at your disposal.

1 comment:

Anonymous said...

I wonder how and where do you catch your ideas. They are so innovative and wonderful!