Is a single big regex more efficient than a bunch of smaller ones?

Question

I'm working on a function that uses regular expressions to find some product codes in a (very long) string given as an argument.

There are many possible forms of that code, for example:

UK[A-z]{10} or DE[A-z]{20} or PL[A-z]{7} or...

What solution would be better? Many (most probably around 20-50) small regular expressions or one huge monster-regex that matches them all? What is better when performance is concerned?

Casimir et Hippolyte · Accepted Answer

It depends what kind of big regex you write. If you end with a pathological pattern it's better to test smaller patterns. Example:

UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7}

this pattern is very inefficient because it starts with an alternation, this means that in the worst case (no match) each alternative needs to be tested for all positions in the string.
_{(* Note that a regex engine like PCRE is able to quickly find potentially matching positions when each branch of an alternation starts with literal characters.)}

But if you write your pattern like this:

(?=[UDP][KEL])(?:UK[A-Za-z]{10}|DE[A-Za-z]{20}|PL[A-Za-z]{7})

or the variation:

[UDP][KEL](?:(?<=UK)[A-Za-z]{10}|(?<=DE)[A-Za-z]{20}|(?<=PL)[A-Za-z]{7})

Most of the positions where the match isn't possible are quickly discarded before the alternation.

Also, when you write a single pattern, obviously, the string is parsed only once.

Is a single big regex more efficient than a bunch of smaller ones?

Answers (1)

Related Questions