joooohnli
joooohnli

Reputation: 171

Python non-greedy regular expression is not exactly what I expected

string: XXaaaXXbbbXXcccXXdddOO

I want to match the minimal string that begin with 'XX' and end with 'OO'.

So I write the non-greedy reg: r'XX.*?OO'

>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']

I thought it will return ['XXdddOO'] but it was so 'greedy'.

Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.

But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.

Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO). And of course the fact is that the string is processed from left to right.

Upvotes: 2

Views: 776

Answers (4)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:

XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO

Upvotes: 1

mont29
mont29

Reputation: 398

Indeed, issue is not with greedy/non-greedy… Solution suggested by @devnull should work, provided you want to avoid even a single X between your XX and OO groups.

Else, you’ll have to use a lookahead (i.e. a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:

re.findall(r'XX(?:.(?!XX))*?OO', str)

With this negative lookahead, you match (non-greedily) any char (.) not followed by XX

Upvotes: 2

Toto
Toto

Reputation: 91430

How about:

.*(XX.*?OO)

The match will be in group 1.

Upvotes: 5

Robin
Robin

Reputation: 9644

Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO. If your data structure is that fixed, you could do:

XX[a-z]{3}OO

to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO: for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO)

Upvotes: 2

Related Questions