Reputation: 29591
regarding regex (specifically python re), if we ignore the way the expression is written, is the length of the text the only factor for the time required to process the document? Or are there other factors (like how the text is structured) that play important roles too?
Upvotes: 3
Views: 2462
Reputation: 2860
To refactor a regex to create a multi-level trie covers 95% of of the 800% increase in performance. The other 5% involves factoring to not only facilitate the trie but to enhance it to give a possible 30x performance boost.
Upvotes: 0
Reputation: 839224
Both the length of the text and its contents are important.
As an example the regular expression a+b
will fail to match quickly on a string containing one million b
s but more slowly on a string containing one million a
s. This is because more backtracking will be required in the second case.
import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)
Results:
6.85791902323
0.00795443275612
Upvotes: 4
Reputation: 336478
One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y
from this regex tutorial.
When applied to xxxxxxxxxxy
it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx
, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.
For xxxxxxxxxxxxxxy
vs. xxxxxxxxxxxxxx
it's already 7 vs 40958 steps, and so on exponentially...
This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.
Upvotes: 6