I posted my article Build Your Own Regex Engine on Reddit the other day, and one of the commenters claimed that the implementation should be trivial to break. Since I had already tested my program against a customized suite of tests, the remark got me thinking about how I could further increase my confidence in the correctness of my program. One extremely low cost however effective strategy for identifying faults in software is known as fuzzing.
The first step is to specify the grammar that our regex engine supports.
You might notice we skipped the
$ characters. More on these in a little bit.
We want to write a function
generateRegex that will select
n random characters from the
regexGrammar and concatenate them together into a string. This string will be used to create a test regex.
Here are three possible returns values of
My regex engine only deals with a very small subset of available regex syntax, and furthermore, it does not contain any error handling. What happens if
generateRegex returns the pattern
^*? My regex engine was never designed to handle these inputs, though they are possible outputs of
generateRegex. We need to make a choice about how to handle these expressions. Since the primary goal of my regex engine is accessibility and simplicity of implementation, I’m not about to begin supporting these edge cases. That means my fuzzer should not generate them either.
One solution to determine if a given regex string is valid is to specify my regex engine’s allowable grammar in BNF. BNF is a formal notation for specifying the syntax of a language. Given this BNF notation, I could ask another program if the randomly generated regex string can be created using my BNF specification. This sounds like a little more work than I want, however, since the invalid cases can simply be manually enumerated and filtered.
One more modification to
generateRegex is necessary to support
$, and then we are basically done.
All that is required now is to repeatedly invoke
generateRegex a fixed number of times and then compare the output of the native JS implementation with the output of my implementation.
I ran my fuzzer for a couple million randomly generated cases and ended up learning two things about my regex engine.
My implementation fails extraordinarily with longer texts. I knew recursion would be a problem for any practical regex implementation (at least without tail calls) and would cause stack overflows, but I didn’t expect it to fail with texts that were only a couple thousand words. I think this is because I make liberal use of backtracking algorithms in
matchStar. Since I was forced to test with a relatively short input text, it makes sense to use multiple text inputs to increase the probability of discovering an error.
My implementation treats the
.character differently than the native implementation. In the RegExp implementation,
.will not match various line terminators (
\u2029). My implementation does.
The biggest takeaway is that fuzzing is an simple and inexpensive way to enumerate enormous sets of inputs and identify bugs in your software. This fuzzer took less than an hour to write.
But remember, this fuzzer’s blessing of a couple million input combinations does not verify the correctness of my program. Not even close. A fuzzer is a tool to identify potential errors. Unless you enumerate all possible inputs (completely impossible in this case where they are infinite), you are not guaranteed your program is error free.