Reputation: 421

Regex for matching large pattern

I am a newbie to regex, I am trying to match some pattern but it works for fewer len pattern but it gets stuck for large pattern(looks like some Catastrophic Backtracking issue).

Below is my string,

world0 world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68
world69 world70 world71 world72 world73 world74 world75 world76 world77 world78
world79 world80 world81 world82 world83 world84 world85 world86 world87 world88
world89 world90 world91 world92 world93 world94 world95 world96 world97 world98
world99 world0 world1 world2 world3 world4 world5 world6 world7 world8 world9
world10 world11 world12 world13 world14 world15 world16 world17 world18 world19
world20 world21 world22 world23 world24 world25 world26 world27 world28 world29
world30 world31 world32 world33 world34 world35 world36 world37 world38 world39
world40 world41 world42 world43 world44 world45 world46 world47 world48 world49
world50 world51 world52 world53 world54 world55 world56 world57 world58 world59
world60 world61 world62 world63 world64 world65 world66 world67 world68 world69
world70 world71 world72 world73 world74 world75 world76 world77 world78 world79
world80 world81 world82 world83 world84 world85 world86 world87 world88 world89
world90 world91 world92 world93 world94 world95 world96 world97 world98

Now my match pattern is a list of string lets say match_list, my expected output is, it should match sub-string from above that has all string defined in match_list string

Small list = ["world0","world1", "world2"]

I tried the following pattern

(?=((\b(?:world0|world1|world2)\b[\w\s]*?){3}))

The above one works fine and matched output is correct which I expect,

[0-20]  `world0 world1 world2`

[7-796] `world1 world2 world3 world4 world5 world6 world7 world8 world9 world10
world11 world12 world13 world14 world15 world16 world17 world18 world19 world20
world21 world22 world23 world24 world25 world26 world27 world28 world29 world30
world31 world32 world33 world34 world35 world36 world37 world38 world39 world40
world41 world42 world43 world44 world45 world46 world47 world48 world49 world50
world51 world52 world53 world54 world55 world56 world57 world58 world59 world60
world61 world62 world63 world64 world65 world66 world67 world68 world69 world70
world71 world72 world73 world74 world75 world76 world77 world78 world79 world80
world81 world82 world83 world84 world85 world86 world87 world88 world89 
world90 world91 world92 world93 world94 world95 world96 world97 world98 world99 world0`

 [14-803] `world2 world3 world4 world5 world6 world7 world8 world9 world10 world11
 world12 world13 world14 world15 world16 world17 world18 world19 world20 world21
 world22 world23 world24 world25 world26 world27 world28 world29 world30 world31
 world32 world33 world34 world35 world36 world37 world38 world39 world40 world41
 world42 world43 world44 world45 world46 world47 world48 world49 world50 world51
 world52 world53 world54 world55 world56 world57 world58 world59 world60 world61
 world62 world63 world64 world65 world66 world67 world68 world69 world70 world71
 world72 world73 world74 world75 world76 world77 world78 world79 world80 world81
 world82 world83 world84 world85 world86 world87 world88 world89 world90 world91
 world92 world93 world94 world95 world96 world97 world98 world99 world0 world1`

[790-810]   `world0 world1 world2`

But for large list = ['world0', 'world1', 'world2', 'world3', 'world4', 'world5', 'world6', 'world7', 'world8', 'world9', 'world10', 'world11', 'world12', 'world13', 'world14', 'world15', 'world16', 'world17', 'world18', 'world19', 'world20', 'world21', 'world22', 'world23', 'world24', 'world25', 'world26', 'world27', 'world28', 'world29', 'world30', 'world31', 'world32', 'world33', 'world34', 'world35', 'world36', 'world37', 'world38', 'world39', 'world40', 'world41', 'world42', 'world43', 'world44', 'world45', 'world46', 'world47', 'world48', 'world49']

Tried following pattern

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

It is throwing me a catastrophic backtracking error. Could you someone tell what am doing wrong or what would be the best way to do it ?

Upvotes: 0

Answers (2)

Casimir et Hippolyte

Reputation: 89567

First thing, your pattern is wrong since it matches world0 world0 world0.

This problem can't be solved only by regex. If I write a pattern (for the regex module) like:

word_list = ['world0', 'world1', 'world2']
p = regex.compile(r'''
    \m (\L<words>)
    \W++ (?>\w+\W+)*? (?!\g{-1})
    (\L<words>) 
    \W++ (*SKIP) (?>\w+\W+)*? (?!\g{-1}|\g{-2})
    (\L<words>) \M 
  ''', regex.VERBOSE, words=word_list)

for m in p.finditer(text, overlapped=True):
     print(m.group(0))

that searches for only three items in your example text, I obtain something complicated (and not efficient even with an optimisation effort), difficult to extend for more items and that will probably crash with more text or more items.

An other possible approach consists to only search words from the list and to create text excerpts in a generator when all the words have been found:

import regex
from collections import deque

data = '''He moved on as he spoke, and the Dormouse followed him: the March Hare moved into the Dormouse’s place, and Alice rather unwillingly took the place of the March Hare. The Hatter was the only one who got any advantage from the change: and Alice was a good deal worse off than before, as the March Hare had just upset the milk-jug into his plate.
Alice did not wish to offend the Dormouse again, so she began very cautiously: `But I don’t understand. Where did they draw the treacle from?’
`You can draw water out of a water-well,’ said the Hatter; `so I should think you could draw treacle out of a treacle-well–eh, stupid?’
`But they were IN the well,’ Alice said to the Dormouse, not choosing to notice this last remark.
`Of course they were’, said the Dormouse; `–well in.’
This answer so confused poor Alice, that she let the Dormouse go on for some time without interrupting it.
`They were learning to draw,’ the Dormouse went on, yawning and rubbing its eyes, for it was getting very sleepy; `and they drew all manner of things–everything that begins with an M–‘
`Why with an M?’ said Alice.
`Why not?’ said the March Hare.
Alice was silent.
The Dormouse had closed its eyes by this time, and was going off into a doze; but, on being pinched by the Hatter, it woke up again with a little shriek, and went on: `–that begins with an M, such as mouse-traps, and the moon, and memory, and muchness– you know you say things are “much of a muchness”–did you ever see such a thing as a drawing of a muchness?’
`Really, now you ask me,’ said Alice, very much confused, `I don’t think–‘
`Then you shouldn’t talk,’ said the Hatter.'''

word_list = ('Dormouse', 'Hatter', 'Alice')

def match_gen(word_list, text):
    p = regex.compile(r'\m\L<words>\M', words=word_list)
    d = deque()
    occlist = [0]*len(word_list)   

    for m in p.finditer(text):
        windex = word_list.index(m.group(0))
        d.append((windex, m.start()))
        occlist[windex] += 1

        while not(0 in occlist):
            elt = d.popleft()
            occlist[elt[0]] -= 1
            yield [elt[1],m.end()],text[elt[1]:m.end()]

for x in match_gen(word_list, data):
    print(x)

The advantages are that there's no more risks of catastrophic backtracking and the few memory usage.

Note: I choose to use the regex module instead of the re module because it has more handy features like the named list, the overlapped flag or the word boundaries \m and \M, but you can do the same with the re module (but you need to use the (?=(...)) for overlapped matches, \b instead of \m and \M, and '|'.join(word_list) to build the alternation).

Note2: If your word list is too long, you can use the same way but instead of using an alternation as pattern (i.e. \L<words>), use only \w+ and check for each match if it is in the list. You can replace the beginning of the previous code like this:

def match_gen(word_list, text):
    p = regex.compile(r'\w+')
    d = deque()
    occlist = [0]*len(word_list)   

    for m in filter(lambda x: x.group(0) in word_list, p.finditer(text)):

Upvotes: 1

Martin Konecny

Reputation: 59621

It seems for your last pattern that you want to match all worlds with numeric suffix than 50. So instead of

(?=((\b(?:world0|world1|world2|world3|world4|world5|world6|world7|world8|world9|wor ld10|world11|world12|world13|world14|world15|world16|world17|world18|world19|world20|world21|world22|world23|world24|world25|world26|world27|world28|world29|world30|world31|world32|world33|world34|world35|world36|world37|world38|world39|world40|world40|world41|world42|world43|world44|world45|world46|world47|world48|world49|world50)\b[\w\s]*?){49}))

Why not the following (match all values from 0-49 or 50):

(?=((\b(?:world([0-4][0-9]?|50))\b[\w\s]*?){3}))

And here is my attempt to clean up your regex based on your description

it should match sub-string from above that has all string defined in match_list string

regex = r'\bworld([0-4][0-9]?|50)\b'
matches = re.findall(regex, "world1 world2 world50 world60")
print matches  # ['world1', 'world2', 'world50']

Upvotes: 0

Regex for matching large pattern

Answers (2)

Related Questions