In this post we will write a simple regex library. The table below presents regex operators that we are going to support:
|A single character e.g.
A single character matches itself.
Special characters need to be escaped e.g.
|Character groups e.g.
||Character group matches any character
that is part of the group.
Negated character group (
Character ranges like
||Just as we can concatenate strings,
we may also concatenate regexes.
The resulting expression will match an input, only when the input can be split into two parts, so that
the first part of the input matches the first concatenated regex and
the second part of the input matches the second concatenated regex.
For example input
||Input matches the alternative when it matches
any branch of the alternative.
For example inputs
Alternative has lower priority than concatenation so
Preceding regex must match input specified number of times.
Supported operators are:
On the other hand
In our limited implementation we do not support spaces (or any other whitespace characters) inside
Repetition has higher priority than concatenation and alternative so
Repetition operators are greedy, this means they will try to match as much input as possible.
Parentheses are used to alter the precedence of the operators.
For example compare
The library itself is quite small, it consists of two parts: a parser and a matcher. The parser is a very simple recursive descent parser and as most of manually written parsers it probably contains some undiscovered bugs. The parser itself is not the main focus of this article, so we will tread it as a black box that consumes regular expressions in text form and produces equivalent ASTs or Abstract Syntax Trees.
ASTs are tree like data structures,
that make order of operator evaluation and the internal structure of a regular
Let’s see this on an example. The AST representation of
^a(foo|bar|egg)*b$ regex is:
$ are represented by their own
To represent both character groups and single characters, we use
GROUP nodes are very simple, they contain a list
of characters that they match. A
GROUP node for
0123456789 in the list.
For negated groups like
[^0-9] we use
the representation is the same as for
GROUP (we keep
in the list).
Next is the tricky one, we use
NEGATED_GROUP without any characters
to represent wildcard (
.). This makes sense because an empty group
does not match anything, so its negation will match all characters.
To represent quantification operators like
REPEAT nodes contain two additional attributes:
minimum and maximum number of allowed repetitions.
Long.MAX_VALUE to signify that maximum number of repetitions is unbound.
Finally we use
CONCAT node to represent concatenation of two or more
In the code all AST nodes are represented by a single class
type field describes what kind of AST node this instance
NEGATED_GROUP nodes keep the set of matched/not-matched
ALTERNATIVE nodes keep their children in
(the order of children is important, hence a
REPEAT node keeps its only child as a single element list in
The matcher is represented in the code by
The interface of the matcher is very simple:
Our matcher will only find the first substring of the input that matches the regex. Yet it would not be too difficult to extend the algorithm to find all matches (start matching again after the end of the previous match).
To implement our matcher we will use an algorithm design technique
called backtracking. The general idea of backtracking is
very simple: we enumerate all the possible solution candidates
in a smart way and then we return the first candidate that is a valid solution.
The part “in a smart way” is very important, usually
enumerating all solution candidates will result in a very
slow algorithm (think
O(2^n) or even
The key here is to quickly and early reject some subsets of
the solutions candidates.
Let’s see this on a very simple example,
say we want to solve a puzzle that is about placing various shapes in 3x3 grid. Some places in the grid are already taken. There is also a rule that describes a valid solution: in every row and column we cannot have two shapes of the same kind.
Simple enumeration of all possible assignments of the shapes to the free places
3^5 solution candidates, most of them wrong.
A smarter candidate generation strategy
would be to check immediately after we fill a free place,
if the solution conditions still holds for this place row and column.
This would save us a lot of work, because we could discard a lot of
solution candidates much more early in the process.
For example we could discard all the solution candidates that have
Cloud shape at 1B position in the first step of the algorithm.
The name of the technique itself comes from the specific way in which the algorithm generates all the solution candidates. The naive backtracking algorithm that solves our simple puzzle looks like this:
The key part of the algorithm is the
where we put a shape on the free place and then remove it
when don’t find a solution. This removal
or taking a move back is what gave the algorithm its name.
Alternatively we may say that when
board is exactly in the same state as it was
before we called
Matching regular expression is somehow similar to solving our previous puzzle.
At each step we have several possibilities, like should we match
fo from the alternative
(foo|fo). How much characters should
f+ expression match? Should
bar or nothing?
On the other hand the most complexity in matching regular expressions
comes from the fact that we are matching a recurrent structure (tree).
It’s like matching our puzzle, but where each empty place can contain
another smaller version of the same puzzle that also must be solved.
The main entry point to our regex matching algorithm is
method. It takes three parameters,
input which is just
String plus a pointer called
pos that tracks next, not yet
Input also provides helpful methods that can
be used to save and restore
pos value (we will need that
for backtracking). Next parameter,
ast, is an AST subtree that the algorithm
should match. The last parameter
cont is the most interesting one.
In my previous blog post I wrote about
continuations, please read that post before
cont is lambda expression (but we may thread it also as
a kind of continuation), that when called will
try to match remaining part of the regex AST (e.g. it will match parents
and the remaining siblings nodes of
The contract of
match method is as follows.
If the method is not able to match
ast subtree it will
cont will not be called and the
will not be modified (or it may be restored to the original state).
On the other hand if
ast subtree could be matched then
cont will be called with the modified
match will return whatever
Before returning value to the client
input will be restored
to the original state (this is not strictly necessary if we
are looking for the first match but somehow makes algorithm more elegant).
At the top level we will call
match somehow like this:
This call means that we are looking for the matches
starting at index
startIndex and if we find one
we save the
end of the match in the
endIndex variable and return
to signify that matching process should stop.
This is the only place in the algorithm where we use
return true. Matched substring could be easily
The body of
match method is a giant switch statement that
delegates matching of different AST types to different methods (not counting simple
Let’s take a look how matching a
GROUP is performed.
If the group matches next input character, we save the current
input pointer in
m variable, then we advance the input pointer
and finally we call
cont to match the
rest of the regex. After
cont returns we restore the input position
finally block. This is a bit hacky but it works.
CONCAT node is also simple. We use recursion to match
subsequent children expressions:
Notice that because we are not consuming any input here we have nothing to restore.
ALTERNATIVE is easy to match:
currExpr points to the current alternative branch that we are matching.
Instead of using recursion we may implement
using a simple
for loop, which I left as an exercise for the reader.
REPEAT node causes the most troubles, because all
quantification operators are greedy by default. This means that
a+ will try to match as much characters as possible.
To implement this behavior we first attempt to match
REPEATs child subtree
as many times as possible, then we move “backwards” calling
each time to check if we have a match.
The diagram below illustrates this process:
The code of
And generally that’s it. Our simple regex engine.
There are some details left (like handling
startIndex) but the idea behind the backtracking
matcher should now be familiar to us.
You can find source code for the engine (including tests) on GitHub: https://github.com/marcin-chwedczuk/reng. But before you jump to see the code I really recommend you to write a similar engine (in your favorite language) yourself. This will give you a much more deeper understanding of the algorithm.
Last but not least, our algorithm has a decent performance,
given that we use reasonable regexes and inputs.
A regex like
(a+)*c matched on the input
will have a very bad performance. This is a common problem
not only in our matcher but in most regex libraries
that use backtracking algorithms. You can
read more about this problem on Wikipedia.