Reputation: 7994
A colleague and I are debating about the complexity of regular expressions.
I have n regular expressions that are a match if a string ends up in ".XX". For instance:
regex 1 matches if the string ends in .AAA
regex 2 matches if the string ends in .BB
regex 3 matches if the string ends in .C
etc. (n times)
Let's say I want to write a function that returns true
if an input string matches any of the regex above.
I could do a loop over the n regexes and return true
as soon as a regexp matches; in that case the complexity is linear in the number n of regexp.
Another method would be to create a single big regexp that would be:
regex A matches if the string ends in .AAA or in .BB or in .C (etc n times)
This time I have a single regexp.
Is this second method better from a complexity point of view? I think that the time to check this regexp is still linear in n. However, my colleague says that this regexp has constant complexity with regards to n. His point is that once the big regex is compiled, there is a huge switch generated with n labels and that branching on a switch is done in constant time (i.e., regardless of how many cases there are in the switch).
So, which answer is correct?
Upvotes: 3
Views: 116
Reputation:
Most regex engines create a trie when you use alternations, for each level of an alternation group.
I'd imagine it doesn't use a switch like statement, more like a linked
list of nodes.
For example if these are all letters, there is an artificial slot for each letter. So no matter how many words you have, on a letter basis, there is only one branch each (at that level).
".AAA or .BB or .C" would produce a regex \.(?:AAA|BB|C)$
; there is only a max of 4 comparisons on failure.
I don't think a switch statement is used.
Also, each time a new regex is entered there is an initialization overhead penalty (not big, but it can add up). So, its better to have a single regex that uses a trie, then to reenter it regularly.
As always, you could easily find out which one matched if you add
capture groups \.(?:(AAA)|(BB)|(C))$
. After the match, a simple boolean
flag check via the matcher will tell you which one matched.
Edit:
To clear things up a bit, some engines auto-generate a regex trie when it
encounters alternation groups, some don't.
Trie's are based on sorted nodal tree structures, with pointers to a next and child node. Where each node can point to another sorted list.
root
|
a --------- e ----- r
| | |
m --- n m o
| | | |
y n m b ----- g
| | | | |
\0 \0 a \0 e
| |
\0 r
|
\0
I think engines might implement a faster version called a ternary tree.
Not sure.
For a reference of performance, take a look at this regex
ASCII 175,000_word Mix A-Z Multi Lined
which is a 175,000 word dictionary.
This regex is a fully implemented trie of a dictionary, with
no more than 7 paths to find each word.
Despite it's size, it is tremendously fast.
It was generated with RegexFormat 7 Ternry Tree tool.
To test it, or just to play around with regex trie's,
grab the trial version and download the dictionary samples.
Open the sample and load its attached regex (from the toolbar),
click in the test window and press F3 again and again.
Upvotes: 4