user1718097
user1718097

Reputation: 4292

Regex with negative look behind in Python

I have a series of free-text comments in a Pandas dataframe. I what to be able to identify those fields that match a given regex that includes a negative look behind. As a trivial example, I have fields such as the following:

frogs seen
green frog seen
no frogs seen
no green frogs seen
frogs not seen
green frogs not seen

I only want to identify those lines where frogs have been seen. In the real dateset, there may be lots of other text included and the phrases shown are contained within the larger text string. The regex I came up with is the following:

(?<!no\s)(?:(?:green\s)?frogs?\s)(?!not\s)(?:seen)?

This almost works. It matches 'frogs seen' and 'green frog seen' as expected. It also does NOT match 'no frogs seen', 'frogs not seen' and 'green frogs not seen' which is exactly what is wanted. However, in the phrase 'no green frogs seen', the regex matches the text 'frogs seen'.

As far as I understand, negative look behinds can only be a fixed number of characters (i.e. it's not possible to use *, + or ? to allow variable string lengths). I thought that including (?:green) in the (?:frogs?) non-capture group would work to find that whole group and negate it if preceded by a fixed length negative-look-behind. However, this does not seem to be the case.

Any suggestions how to fix this issue would be very much appreciated.

Upvotes: 1

Views: 125

Answers (2)

Booboo
Booboo

Reputation: 44128

The reason why your lookbehind doesn't work, I believe, is because you have (?:green\s)?, making 'green ' optional. When the scanner arrives at 'frog', it looks back three characters looking for 'no ' and doesn't find it, so it accepts 'no green frogs seen' as a match. If you had instead (?:green\s), so that 'green ' was not optional, this test case would be rejected. So, instead of using negative lookbehind, try negative lookahead:

import re

test_cases = [
'frogs seen',
'green frog seen',
'no frogs seen',
'no green frogs seen',
'frogs not seen',
'green frogs not seen'
]

regex = re.compile(r'(?!no\s+)(?:(?:green\s+)?frogs?)(?=\s+seen)')
for test_case in test_cases:
    if re.match(regex, test_case):
        print(test_case)

Prints:

frogs seen
green frog seen

Upvotes: 1

Andrej Kesely
Andrej Kesely

Reputation: 195438

I came up with this regex (regex101):

test_cases = [
'frogs seen',
'green frog seen',
'no frogs seen',
'no green frogs seen',
'frogs not seen',
'green frogs not seen'
]

import re

for test_case in test_cases:
    m = re.findall(r'^((?!(?:(?:\bno\b.*frogs?)|(?:frogs?.*\bnot\b.*seen))).)*$', test_case)
    if m:
        print('{} matches!'.format(test_case))

Prints:

frogs seen matches!
green frog seen matches!

Upvotes: 2

Related Questions