In this article I would like to describe the Shift-And algorithm for pattern matching, based on the book Flexible Pattern Matching in Strings. Apart from the actual description, you can scroll down for a demonstration of how the algorithm works.
Shift-And is categorized as a prefix-based algorithm, similar to KMP, but what Shift-And does is to keep a set of all prefixes
p that matches a suffix of the text read so far. Keep in mind that is the suffix of the text read so far, not a suffix of the whole text.
The set of prefixes is updated using bit-parallelism, that is, the set of prefixes is kept in a bit mask called
D is defined as: dm … d1, where
m is the canonical way of referring to the pattern length. For example, if we have a six character pattern, and so far our first character matches against the text, then
D would be
With every character read from the text,
D will be updated, if the bit dm is active, then we have a match. Is it possible to update this set in constant time? Yes, using bit-parallel operations.
This algorithm will first build a table
B, holding a bit mask bm … b1 for every character in the pattern. So, if we are searching for the word
announce, the entry for the character
n will look like this:
00100110, that is, the word
announce has an
So the bit mask table for the word
announce will end up looking like this:
Of course we will have the actual numbers, not those bit masks, with the padding and so on. We also added an
* entry in the table. That’s because whenever a read character from the text doesn’t match, then the bit mask used will be
Once we have the table built, we start searching for the pattern in the text. First we set
0, and then we start reading from the text, character by character. Let’s say we read an
a, so we search for the
a bit mask in our table, it will return
00000001. To calculate the new
D we first shift it
1 to the left, and we
1 to it :
(D << 1) | 1. That value gets
AND'ed to the mask we retrieved from the table, so this would be the complete line:
D <- ((D << 1) | 1) & B[current-char].
We keep iterating over the text until we find a match. How do we find a match? We need to check if the bit dm is active. To do that we have to run the following operation:
D & (1 << m-1). We create the following bit mask
10000000 and we
AND it to
D. If this is not
0, then we have a match.
And now to the interesting part, the visual simulation:
Shift-And visual simulation
Let me briefly explain what’s going on here. First we have a form, where you can enter the pattern to search, and the text where to search it for. That’s easy.
start, the visualization will display the text and bellow it the pattern you are searching for. Beneath them, there will be the
b table and next to it, the results for each step will be reported.
For example, we will see the initial value of D, then it will be
(D << 1) | 1. Then we read a character from the text, and we get the value from the table,
AND'ing it to the value of
D and getting the result below, and so on.
I hope the colors are pretty self explanatory (I’m colorblind so don’t trust me on this one).
So, click start and enjoy the show.
I hope this post has made justice to what I think is one of the most elegant algorithms that I know of. If there was a book called “Algorithms from the Book” similar to this, I’m sure this algorithm would make it.
If you are interested in learning more about Pattern Matching in Strings, then I would recommend you get a copy of the book Flexible Pattern Matching in Strings since it’s really worth every penny.
I would like to mention also Mr. @sicher who basically single handedly fixed the mess of HTML and CSS that I had created. So if you enjoyed the visual styles, then send him some props on twitter.