Reputation: 743
I am using re.sub()
with some complicated patterns (created by the code) that can cause backtracking.
Is there any practical way to abort re.sub
(say pretend that pattern is not found, or raise an error) after a certain number of iterations in Python 2.6?
Sample (this is of course a dumb pattern, but it's created dynamically by a complex text processing engine):
>>>re.sub('[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*','*','ilililililililililililililililililililililililililililililililililil :x')
Upvotes: 2
Views: 293
Reputation: 35562
Other than analyzing the regex for the potential of catastrophic backtracking (a hard problem with an external regex) or using a different regex engine that does not allow backtracking, I think the only way is with a timeout of this nature:
import re
import signal
class Timeout(Exception):
pass
def try_one(pat,rep,s,t=3):
def timeout_handler(signum, frame):
raise Timeout()
old_handler = signal.signal(signal.SIGALRM, timeout_handler)
signal.alarm(t)
try:
ret=re.sub(pat, rep, s)
except Timeout:
print('"{}" timed out after {} seconds'.format(pat,t))
return None
finally:
signal.signal(signal.SIGALRM, old_handler)
signal.alarm(0)
return ret
try_one(r'^(.+?)\1+$', r'\1' ,"a" * 1000000 + "b")
Trying to replace a large repetitions of a single character (in this case one million 'a' characters) is a classic catastrophic regex failure. It would take tens of thousands of years to complete (at least with Python or Perl. Awk is different).
After 3 seconds of trying, it times out gracefully and prints:
"^(.+?)\1+$" timed out after 3 seconds
Upvotes: 5
Reputation: 14864
count
can help you here:
In [9]: re.sub ?
Type: function
Base Class: <type 'function'>
String Form:<function sub at 0x00AC7CF0>
Namespace: Interactive
File: c:\python27\lib\re.py
Definition: re.sub(pattern, repl, string, count=0, flags=0)
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.
In [13]: a = "bbbbbbb"
In [14]: x = re.sub('b', 'a', a, count=3)
In [15]: x
Out[15]: 'aaabbbb'
Upvotes: -1