mercador
mercador

Reputation: 743

python2 re.sub: abort catastrophic pattern on backtracking

I am using re.sub() with some complicated patterns (created by the code) that can cause backtracking.

Is there any practical way to abort re.sub (say pretend that pattern is not found, or raise an error) after a certain number of iterations in Python 2.6?

Sample (this is of course a dumb pattern, but it's created dynamically by a complex text processing engine):

>>>re.sub('[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*[i1l!|](?:[^i1l!|\\w]|[i1l!|])*[l1i!|](?:[^l1i!||\\w]|[l1i!|])*','*','ilililililililililililililililililililililililililililililililililil :x')

Upvotes: 2

Views: 293

Answers (2)

the wolf
the wolf

Reputation: 35562

Other than analyzing the regex for the potential of catastrophic backtracking (a hard problem with an external regex) or using a different regex engine that does not allow backtracking, I think the only way is with a timeout of this nature:

import re
import signal

class Timeout(Exception): 
    pass 

def try_one(pat,rep,s,t=3):
    def timeout_handler(signum, frame):
        raise Timeout()

    old_handler = signal.signal(signal.SIGALRM, timeout_handler) 
    signal.alarm(t) 

    try: 
        ret=re.sub(pat, rep, s)

    except Timeout:
        print('"{}" timed out after {} seconds'.format(pat,t))
        return None

    finally:
        signal.signal(signal.SIGALRM, old_handler) 

    signal.alarm(0)
    return ret

try_one(r'^(.+?)\1+$', r'\1' ,"a" * 1000000 + "b")

Trying to replace a large repetitions of a single character (in this case one million 'a' characters) is a classic catastrophic regex failure. It would take tens of thousands of years to complete (at least with Python or Perl. Awk is different).

After 3 seconds of trying, it times out gracefully and prints:

"^(.+?)\1+$" timed out after 3 seconds

Upvotes: 5

avasal
avasal

Reputation: 14864

count can help you here:

In [9]: re.sub ?
Type:       function
Base Class: <type 'function'>
String Form:<function sub at 0x00AC7CF0>
Namespace:  Interactive
File:       c:\python27\lib\re.py
Definition: re.sub(pattern, repl, string, count=0, flags=0)
Docstring:
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl.  repl can be either a string or a callable;
if a string, backslash escapes in it are processed.  If it is
a callable, it's passed the match object and must return
a replacement string to be used.


In [13]: a = "bbbbbbb"

In [14]: x = re.sub('b', 'a', a, count=3)

In [15]: x
Out[15]: 'aaabbbb'

Upvotes: -1

Related Questions