Skip to content

Regex Induces Lunacy

August 21, 2011

This week my coworker and I discussed whether regular expressions belong in production code. His opinion is there almost always exists a more straightforward way to match text that doesn’t require a language consisting of arcane syntax. His preference is to use a perhaps less powerful, but easier approach, such as a string library.

While I see his point because sometimes regular expressions are abused, when applied appropriately and sparingly it is a fine practice, because you can’t really beat the flexibility and power it offers.

By the way, this entire article is from the perspective of server systems that manage and serve data consisting of the activities of a billion users in real-time. And in this context, while some text-processing is involved, we work primarily on the distributed system components, where regular expressions don’t occur much in typical code.

The Wacky Regex

The reason we had this conversation is that I encountered an expression that baffled me, and asked him what he thought it did:

^[-]?\d+(?=(,.+$)|(?!.+$))

When confronted with a regex I don’t understand, I break it into its sub-expressions and learn the functionality from the bottom up. For example:

.+$matches any character sequence to end of line, i.e. “stuff until EOL”
(?! )successful at a position where the inside expression does not match
(?!.+$)successful at a position where there is no “stuff until EOL”
(,.+$)matches a comma, followed by any character sequence to the end of line, i.e. “comma then stuff until EOL”
(?= )successful at a position where the inside expression matches
(?=(,.+$)|(?!.+$))successful at a position where there is either a “comma then stuff until EOL” or at a “position where there is no stuff until EOL”
^[-]?\d+a string of one or more digits occurring at the front of a line, with an optional negative sign
^[-]?\d+(?=(,.+$)|(?!.+$))a digit, possibly negative, followed by either a comma then stuff or nothing, until the EOL


This expression is a complicated way to match and extract an integer that occurs at the beginning of a line, optionally negative, and possibly followed by a comma-separated substring.

Because the (?= ) expression is an anchor, it’s not actually part of the match, so the resulting match is just the number. I already knew from the context of the code that it was extracting a number (and the \d+ was a tell), and the code was working, so I needn’t meddle with it. But because I didn’t understand the expression, my curiosity wouldn’t let me leave it alone until I did. And once understanding the intent, making it easier for the next person who finds it seemed like the thing to do.

I chose something more straightforward (see the comments for even better alternatives):

^(-?\d+)(,.+|)$

Here’s a breakdown:

^(-?\d+)capture a string of one or more digits occurring at the front of a line, with an optional negative sign
(,.+|)capture either a “comma then stuff” or nothing
^(-?\d+)(,.+|)$capture a digit, with option negative sign, followed by either a comma then stuff or nothing until the EOL


Without resorting to the same level of fanciness, this expression is compatible and easier to understand. (I haven’t found an example where the modified version differs from the original, but perhaps there’s some overlooked corner case… if you know of one, let me know). Using a capture, I neatly grab the number as well. Since the section after the number isn’t needed, a slightly more stylish expression may employ the non-capturing grouping: ^(-?\d+)(?:,.+|)$.

My regex-avoiding coworker might write a loop over the string to check and parse the numeral, looking something like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
bool TheHardWay(string line, out int result)
{
    // precondition !string.IsNullOrEmpty(line) check elided
    int startIndex = (line[0] == '-') ? 1 : 0;
 
    int pos;
    for(pos=startIndex; pos < line.Length; ++pos)
    {
        if(!char.IsDigit(line, pos))
        {
            break;
        }
    }
 
    // successfully matched if there is at least one digit, and we ended
    // at the EOL or on a comma
    bool match = (pos > startIndex) &&
                 ((pos == line.Length) ||
                  (line[pos] == ','));
 
    result = match ? int.Parse(line.Substring(0, pos)) : 0;
    return match;
}

However, in this case I’d prefer the regular expression:

bool TheRegexWay(string line, out int result)
{
    // precondition !string.IsNullOrEmpty(line) check elided
    Match match = Regex.Match(line, @"^(-?\d+)(,.+|)$");
    result = match.Success ? int.Parse(match.Groups[1].Value) : 0;
    return match.Success;
}

By the way, these two functions are not quite equivalent. The regex asserts that if there is a comma, then there is at least one character that follows it, while the C# loop does not check if there is content after the comma. The loop is doing something more like ^(-?\d+)(,|$). In order to match the regex, change the condition on line 19 to (line[pos] == ',' && line.Length > pos+1).

The original example shows why people may shy away from using regular expressions: a coworker and I were both perplexed by it. But this is because it unnecessarily used advanced features (both positive and negative lookaheads). I don’t know if the author is really proficient at regex, or was just smoking some good stuff at the time. While figuring out how it works was like solving a neat puzzle, this is not always enjoyable under deadline pressure.

I like the revised version over the manual string parsing function. It’s more straightforward and manageable. Of course, this is due to my comfort level with regex features. I like using the subset stored in my head without a reference. Working through examples like the first one helps me expand this knowledge, but I don’t use regex enough to frequently encounter the more esoteric features, and because of the nature of my projects, many coworkers feel the same. If my teams were made of text-processing jockeys, then that’s a different story.

In the same source code was a second confusing expression:

^[-\\0-9A-Za-z_.:$]+(?=(,.+$)|(?!.+$))

From the surrounding context I could tell that this looks for a file path. And after encountering the first expression, the second half is accounted for: looking for an optional, comma-delimited substring. So the new section is just ^[-\\0-9A-Za-z_.:$]+. This is not so unmanageable: match a string composed of one or more of the included characters (digits, lower or upper-case letters, or the symbols -, \, _, ., :, or $) that occur at the start of the line. This one is buggy because it ignores other legal symbols in file paths, including space and comma. In fact, a comma in a file path would break this expression from working correctly. In this case the regex is problematic, another cause for frustration.

Power vs. Performance

A nice thing about regex expressions is that you can write a powerful function with a few characters of code (forget counting lines), as demonstrated by comparing the C# version to the 16 character regex version of the two examples above. However with this power comes responsibility, and I’ve seen production code (where performance matters) abuse regex due to its convenience.

The most egregious example was a bad-words filter list using a giant ‘or’ expression:

(a|very|long|list|of|unmentionable|bad|words|that|includes|hundreds|of|entries)

CPU profiling the program showed that about 80% of its time was spent in that function. Incredulously, I showed it to my coworkers, who then became more interested in finding new, and increasingly disturbing, words to add to the list (thus deteriorating performance further). I replaced it with a trie search structure, and its CPU use dropped to a negligible amount, undetected by the sampling profiler.

Sometimes, you don’t want to avoid using a regex. In this case, it’s good to know that some languages provide a way to increase performance by compiling the expression. For example, I found another regex that consumed about 40% of the program’s CPU use, because it was used repeatedly. It was the kind that did something useful with a short, understandable expression, so writing a specialized function to replace it was undesirable as a first cut.

It was declared like so:

private static Regex myRegex = new Regex(regexString);

In .NET, you may add the “Compiled” flag to the constructor like so:

private static Regex myRegex = new Regex(regexString, RegexOptions.Compiled);

The CPU profile of the compiled regex dropped from 40% to 7%, much more attractive, and now good enough to leave alone and focus on other tasks. Over four times improvement came from setting the “turbo” flag. It turns out that without compiling, the regex is interpreted on each use, which is much more expensive. Compiling it is expensive too, but results in a faster interpretation. So if the expression is used often , the one-time cost of compiling may give significant savings over the reuse lifetime. Although this isn’t always the case, as described here you’ll need to evaluate it on a case-by-case basis.

Lastly, here’s a place where regex seemed inappropriate. The following example sucks in a 1 MB log file, to process it line-by-line, where each line is about 200 characters (so roughly 5000 lines). It did so the following way:

foreach (String line in Regex.Split(fileContents, "\r\n|\r|\n"))
{
    ...
}

My first instinct is wonder why use regex for splitting, when there is String.Split() with the same effect? This must be the work of a Perl programmer. If the pattern is straightforward text, a string library does the job nicely without firing up a regex interpreter. But even that seems absurd, given splitting produces 5000 string objects. Instead, use something that appropriately processes the input line-by-line. A StringReader preserved my sanity:

using (var sr = new StringReader(fileContents))
{
    string line;
    while ((line = sr.ReadLine()) != null)
    {
        ....
    }
}

Ok, so yeah, processing a log file is not a performance critical application. I sometimes forget I don’t work on embedded software these days. In this case, wasting an extra 5000 strings probably won’t matter in the overall scheme of things because 1 MB of memory is nothing on a modern processor (the other loop also creates the same number of strings, it just does so line-by-line instead of all at once). But doing it line-by-line allows the memory to be reclaimed faster, and these resource considerations may matter for larger file sizes, or when the loop can terminate early, etc.

Conclusion

In summary, when using regular expressions:

  • Choose the simplest expression that works, remembering that other people will need to understand it. Resist showing off your awesome regex skillz unless the job actually calls for it.
  • In code where performance matters, consider that other options such as custom-written functions may beat regex performance–there’s a trade-off between performance and generality of expression. Measure if possible, then decide.
  • Consider compiling often-used expressions when possible, if you can amortize the setup cost. Again, measure and decide.
  • Don’t use regex when it’s the lesser obvious choice.

In short, don’t be a regex lunatic.

And to reiterate one of the opening points, this is meant in the context of certain kinds of applications. This is not a blanket denouncement of regular expressions. If your application is text processing, you probably want the advanced features of regex. They were created with good reason, so use them if you need them.

One Comment leave one →
  1. sbanacho permalink*
    August 22, 2011 9:54 AM

    Last night, when I wish I was sleeping, instead an alternate way to express ^(-?\d+)(,.+|)$ popped into my head. In fact, two ways:

    1. ^(-?\d+)(,.+)?$
    2. ^(-?\d+)(,.+)*$

    It turns out either is equivalent in what they match, and in what they capture, even if there are multiple commas in the substring, because the .+ greedily consumes any subsequent commas. If (2) is slightly modified to ^(-?\d+)(,.+?)*$, then it will capture comma-delimited strings separately.

Leave a Reply

Your email address will not be published. Required fields are marked *