goh
goh

Reputation: 29591

python regex speed

regarding regex (specifically python re), if we ignore the way the expression is written, is the length of the text the only factor for the time required to process the document? Or are there other factors (like how the text is structured) that play important roles too?

Upvotes: 3

Views: 2462

Answers (3)

sln
sln

Reputation: 2860

To refactor a regex to create a multi-level trie covers 95% of of the 800% increase in performance. The other 5% involves factoring to not only facilitate the trie but to enhance it to give a possible 30x performance boost.

Upvotes: 0

Mark Byers
Mark Byers

Reputation: 839224

Both the length of the text and its contents are important.

As an example the regular expression a+b will fail to match quickly on a string containing one million bs but more slowly on a string containing one million as. This is because more backtracking will be required in the second case.

import timeit
x = "re.search('a+b', s)"
print timeit.timeit(x, "import re;s='a'*10000", number=10)
print timeit.timeit(x, "import re;s='b'*10000", number=10)

Results:

6.85791902323
0.00795443275612

Upvotes: 4

Tim Pietzcker
Tim Pietzcker

Reputation: 336478

One important consideration can also be whether the text actually matches the regular expression. Take (as a contrived example) the regex (x+x+)+y from this regex tutorial.

When applied to xxxxxxxxxxy it matches, taking the regex engine 7 steps. When applied to xxxxxxxxxx, it fails (of course), but it takes the engine 2558 steps to arrive at this conclusion.

For xxxxxxxxxxxxxxy vs. xxxxxxxxxxxxxx it's already 7 vs 40958 steps, and so on exponentially...

This happens especially easily with nested repetitions or regexes where the same text can be matched by two or more different parts of the regex, forcing the engine to try all permutations before being able to declare failure. This is then called catastrophic backtracking.

Upvotes: 6

Related Questions