HammerMeetNail
HammerMeetNail

Reputation: 350

Python Regex slower than expected

I read a cool article on how to avoid creating slow regular expressions. Generally speaking it looks like the longer and more explicit and regex is the faster it will complete. A greedy regex can be exponentially slower.

I thought I would test this out by measuring the time it takes to complete a more complex/explicit statement with a less complex/greedy statement. For the most part everything seems to hold true, but I have one greedy statement that clocked in slower. Here are two examples:

import re
from timeit import timeit

# This works as expected, the explicit is faster than the greedy.
# http_x_real_ip explicit 
print(timeit(setup="import re", stmt='''r = re.search(r'(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})', '192.168.1.1 999.999.999.999')''', number=1000000))
1.159849308001867

# http_x_real_ip greedy
print(timeit(setup="import re", stmt='''r = re.search(r'((?:\d{1,3}\.){3}\d{1,3})', '192.168.1.1 999.999.999.999')''', number=1000000))
1.7421739230003368

# This does not work as expected, greedy is faster.
# time_local explicit
print(timeit(setup="import re", stmt='''r = re.search(r'(\d{1,2}/\w{3}/[2][0]\d{2}:\d{2}:\d{2}:\d{2}\s[+][0]{4})', "[23/Jun/2015:11:10:57 +0000]")''', number=1000000))
1.248802040994633

# time_local greedy
print(timeit(setup="import re", stmt='''r = re.search(r'\[(.*)\]', "[23/Jun/2015:11:10:57 +0000]")''', number=1000000))
1.0256699790043058

Is the local_time explict regex just poorly written?

Upvotes: 10

Views: 307

Answers (2)

Alex Huszagh
Alex Huszagh

Reputation: 14614

You also are not using the re.compile feature of Python regexes, meaning your search time also includes time for the re module to compile the regex at each iteration.

>>> print(timeit(setup="import re", stmt='''r = re.search(r'(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})', '192.168.1.1 999.999.999.999')''', number=1000000))
0.73820400238
>>> print(timeit(setup="import re; regex = re.compile(r'(\d{1,3}\.\d{1,3}.\d{1,3}.\d{1,3})')", stmt='''r = regex.search('192.168.1.1 999.999.999.999')''', number=1000000))
0.271140813828
>>> print(timeit(setup="import re; regex = re.compile(r'((?:\d{1,3}\.){3}\d{1,3})')", stmt='''r = regex.search('192.168.1.1 999.999.999.999')''', number=1000000))
0.31952214241
>>> print(timeit(setup="import re; regex = re.compile(r'(\d{1,2}/\w{3}/[2][0]\d{2}:\d{2}:\d{2}:\d{2}\s[+][0]{4})')", stmt='''r = regex.search("[23/Jun/2015:11:10:57 +0000]")''', number=1000000))
0.371844053268
>>> 

The difference between the greedy and non-greedy regex here is actually much closer to expected when you pre-compile. The rest of the explanation goes to backtracking.

We can see that your tests speed up almost by a factor of 3 if you pre-compile your regexes for a large number of iterations.

This answer is meant to complement @mescalinum's answer, but for a large number of regexes, you should really be compiling the regexes ahead of time for a fair comparison.

Upvotes: 2

fferri
fferri

Reputation: 18940

The more a regular expression has to backtrack, the slower it is.

This might not hold for very small input data. However, who would care about the performance on small data? :D


This topic is well covered in this article:

Also there are interesting contributions in this question:

Upvotes: 2

Related Questions