Jan
Jan

Reputation: 43169

Match in specific locations

This is a follow-up of this question (not asked by me though). Trying to answer, I ran into a couple of problems.

Consider the string strings123[abc789<span>123</span>def<span>456</span>000]strings456, how would one match the digits in square brackets that are not surrounded by span tags in Python (using the newer regex module) ?
In the example string, this would be 789 and 000.


I was fiddling around with \G like (demo)

(?:\G(?!\A)|\[)
[^\d\]]*
\K
\d+

and (*SKIP)(*FAIL) (demo):

<span>.*?</span>(*SKIP)(*FAIL)
|
\d+

But was unable to combine both statements:

<span>.*?</span>(*SKIP)(*FAIL)
|
(?:
    (?:\G(?!\A)|\[)
    [^\d\]]*
    (\d+)
    [^\d\]]*
    \K
)

How can this be done?

Upvotes: 3

Views: 75

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

One of the things I like about PyPi regex module is that it supports infinite-width lookbehind:

  • Variable-length lookbehind

       A lookbehind can match a variable-length string.

>>> import regex
>>> s = 'strings123[abc789<span>123</span>def<span>456</span>000]strings456'
>>> rx = r'(?<=\[[^][]*)(?:<span>[^<]*</span>(*SKIP)(?!)|\d+)(?=[^][]*])'
>>> regex.findall(rx, s)
['789', '000']
>>> 

Pattern details:

  • (?<=\[[^][]*) - there must be a [ followed with zero or more chars other than ] and [ immediately to the left of the current location
  • (?: - a non-capturing group start
    • <span>[^<]*</span>(*SKIP)(?!) - match a <span>, then 0+ chars other than < (with a [^<]* negated character class), and then a </span> and discard the match while staying at the match end position, and go on to look for the next match
    • | - or
    • \d+ - 1+ digits
  • (?=[^][]*]) - there must be a ] after zero or more chars other than ] and [ immediately to the right of the current location.

Upvotes: 3

Rahul
Rahul

Reputation: 2748

I thought of an algorithm which is as follows.

  1. Search for square brackets and contents within it and store result in a variable. Regex would be \[[^]]*\].

  2. Now search for <span> tags and replace it with - just for simplicity of next step. Regex would be (<span>.*?</span>).

  3. Now you will be left with contents of square brackets other than what was in <span> tags. Simply search with \d+ to match digits.

Upvotes: 1

Related Questions