Python Regex Negative Lookbehind Match without Fixed width

Question

I want to find a better way to get my result. I use a regex pattern to match all text of the form (DD+ some text DDDD some other text) if and only if it is not preceded of non-fixed width lookbehind terms. How can I include these terms inside of my REGEX pattern ?

aa = pd.DataFrame({"test": ["45 python 00222 sometext",
                            "python white 45 regex 00 222 somewhere",
                            "php noise 45 python 65000 sm",
                            "otherword 45 python 50000 sm"]})
pattern = re.compile("(((\d+)\s?([^\W\d_]+)\s?)?(\d{2}\s?\d{3})\s?(\w.+))")
aa["result"] = aa["test"].apply(lambda x: pattern.search(x)[0] if pattern.search(x) else None)
lookbehind = ['python', 'php']
aa.apply(lambda x: "" if any(look in x["test"].replace(x["result"], "") for look in lookbehind) else x["result"], axis=1)

The output is what I expected

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm

Wiktor Stribiżew · Accepted Answer

You may use a hack that consists in capturing php or python before the expected match, and if the group is not empty (if it matched), discard the current match, else, the match is valid.

See

pattern = re.compile(r"(?:(php|python).*?)?((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)")

The pattern contains 2 capturing groups:

(?:(php|python).*?)? - the last ? makes this group optional, it matches and captures into Group 1 php or python, and then 0+ chars, as few as possible
((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+) - this is Group 2 that is basically your pattern with no redundand groups.

If Group 1 matches, we need to return an empty result, else, Group 2 value:

def callback(v):
    m = pattern.search(v)
    if m and not m.group(1):
        return m.group(2)
    return ""

aa["test"].apply(lambda x: callback(x))

Result:

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm

Python Regex Negative Lookbehind Match without Fixed width

Answers (2)

Related Questions