q2ven
q2ven

Reputation: 187

greedy and greedy in re module

What is the difference between /(a+)+c/ and /(a+)c/ in re module? I checked how long it takes to be executed. There was a big difference. I want to know why there is a big difference. I entered this sample into form of Online regex tester but I did not understand what was happening. Please tell me in words.

import re
from time import clock    
def test(f, *args, *kargs):
  start = now()
  f(*args, *kargs)
  print("The function %s lasted: %f" % (f.__name__, now() - start))

def catastrophic1(n):
  print("Testing with %d characters" % n)
  pat = re.compile("(a+)+c")
  text = "%s" % 'a' * n
  pat.search(text)

def catastrophic2(n):
  print("Testing with %d characters" % n)
  pat = re.compile("(a+)c")
  text = "%s" % 'a' * n
  pat.search(text)

for n in range(13, 20):
  test(catastrophic1, n)
for n in range(13, 20):
  test(catastrophic2, n)

Upvotes: 1

Views: 80

Answers (2)

Kasravnd
Kasravnd

Reputation: 107297

The answer of this problem is depend on you that can understand how python regex engine interpret the + and what it does for match a pattern.

Actually python regex engine uses Traditional NFA for its regex matches and based on essence of an NFA engine that :

it considers each subexpression or component in turn, and whenever it needs to decide between two equally viable options, it selects one and remembers the other to return to later if need be. Situations where it has to decide among courses of action include anything with a quantifier (decide whether to try another match), and alternation (decide which alter native to try, and which to leave for later). Whichever course of action is attempted, if it’s successful and the rest of the regex is also successful, the match is finished. If anything in the rest of the regex eventually causes failure, the regex engine knows it can backtrack to where it chose the first option, and can continue with the match by trying the other option. This way, it eventually tries all possible permutations of the regex (or at least as many as needed until a match is found).*

And in addition of the preceding processes for a + regex engine will try all combinations of the preceding pattern from length 1 or more so for a pattern like (a+)+c we have a exponential number of tries!and it will gobble up a lot of time!


* Mastering Regular Expressions, Third Edition, by Jeffrey E. F. Friedl!

Upvotes: 1

codebox
codebox

Reputation: 20254

Both expressions match the same thing, namely one or more as followed by one c.

The expression (a+)+c is more time-consuming for the regex engine to process because there are more ways that a string of as could be matched by this expression.

For example, using the second expression, the string aaaaaa could be broken down into the following groups

(a)(a)(a)(a)(a)(a) # here (a+) matches a single 'a'
(aa)(aa)           # here (a+) matches 'aa'
(aaa)(aaa)         # here (a+) matches 'aaa'

Upvotes: 1

Related Questions