Yogev Shitrit
Yogev Shitrit

Reputation: 21

Regex taking too much time

I created 2 regex (re1 and re2), if I try to compile the first regex (re1) it takes about 30 seconds to find all matches. and if I try to compile the second regex (re2) it takes about 1 second to find all matches.

Can you help me find the difference or what causing this problem? Thanks!

import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'.*00.*00.*00.*00.*76.*62.*55.*75.*'
re2 = b'.*aa.*00.*00.*6f.*63.*20.*6d.*75.*'
reg = re.compile(b'^%s$' % re1, re.RegexFlag.M)
results = len(reg.findall(data))
print(results)

Upvotes: 0

Views: 555

Answers (1)

Jérôme Richard
Jérôme Richard

Reputation: 50278

The problem comes from backtracking in the implementation of CPython regexp engine (as emphasized by @Thefourthbird). It comes more specifically from the first and the last .* which are not needed if data do not contain new line characters. Indeed, in this case, findall will either find only one match (all data due to the .*) or nothing. So you do not need findall: a search is enough. Moreover, using ^ and $ with .* prefixed and suffixed is not useful too. The following code should produce the same effect but is 20 times faster on my machine (still not very fast regarding the input size).

import re
data = b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
re1 = b'00.*00.*00.*00.*76.*62.*55.*75'
re2 = b'aa.*00.*00.*6f.*63.*20.*6d.*75'
reg = re.compile(re1, re.RegexFlag.M)
results = 1 if reg.search(data) else 0
print(results)

If data contain new line characters, then it is a bit more complex as only the line containing the pattern will be matched. Indeed, . does not match a new line character (since RegexFlag.DOTALL is not present). One solution to overcome the problem consists in splitting lines before and then apply the regexp. Another solution consists in using the above code, then replacing the search line with a finditer to track the location of the matches, then track the beginning and the end of the lines of each match.

If you want to know more about why backtracking causes such a slow execution you can look at this post: Why can regular expressions have an exponential running time?.

Note that re1 is a quite critical regexp for this data. There are much faster regexp engine to compute it. The Google's RE2 regexp engine is a linear-time regexp engine that should be much faster in such critical cases (but generally not so on non-critical data). There is also the Intel's Hyperscan regexp engine which is generally very fast compared to other engines (although a bit less user-friendly).

Upvotes: 1

Related Questions