J. Doe
J. Doe

Reputation: 3634

Python Regex Negative Lookbehind Match without Fixed width

I want to find a better way to get my result. I use a regex pattern to match all text of the form (DD+ some text DDDD some other text) if and only if it is not preceded of non-fixed width lookbehind terms. How can I include these terms inside of my REGEX pattern ?

aa = pd.DataFrame({"test": ["45 python 00222 sometext",
                            "python white 45 regex 00 222 somewhere",
                            "php noise 45 python 65000 sm",
                            "otherword 45 python 50000 sm"]})
pattern = re.compile("(((\d+)\s?([^\W\d_]+)\s?)?(\d{2}\s?\d{3})\s?(\w.+))")
aa["result"] = aa["test"].apply(lambda x: pattern.search(x)[0] if pattern.search(x) else None)
lookbehind = ['python', 'php']
aa.apply(lambda x: "" if any(look in x["test"].replace(x["result"], "") for look in lookbehind) else x["result"], axis=1)

The output is what I expected

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm

Upvotes: 2

Views: 1677

Answers (2)

Valdi_Bo
Valdi_Bo

Reputation: 30991

As negative lookbehind must be of fixed length, you have to use negative lookahead, anchored to the start of string, checking the part before the first digit.

It should include:

  • A sequence of non-digits (possibly empty).
  • Either of your "forbidden" strings.

This way, if the string to check contains python or php before the first digit, this lookahead will fail, preventing this string from further processing.

Because of the ^ anchor, the rest of regex must first match a sequence of non-digits (what is before "DD+" part) and then there should be your regex.

So the regex to use is as follows:

^(?!\D*(?:python|php))\D*(\d+)\s?([^\W\d_]+)\s?(\d{2}\s?\d{3})\s?(\w+)

Details:

  • ^(?! - Start of string and negative lookahead for:
    • \D* - A sequence of non-digits (may be empty).
    • (?:python|php) - Either of the "forbidden" strings, as a non-capturing group (no need to capture it).
  • ) - End of negative lookahead.
  • \D* - A sequence of non-digits (before what you want to match).
  • (\d+)\s? - The first sequence of digits + optional space.
  • ([^\W\d_]+)\s? - Some text No 1 + optional space.
  • (\d{2}\s?\d{3})\s? - The second sequence of digits (with optional space in the middle) + optional space.
  • (\w+) - Some text No 2.

The advantage of my solution over the other is that you are free from checking whether the first group matched. Here you get only "positive" cases, which do not require any check.

For a working example see https://regex101.com/r/gl9nWx/1

Upvotes: 1

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626926

You may use a hack that consists in capturing php or python before the expected match, and if the group is not empty (if it matched), discard the current match, else, the match is valid.

See

pattern = re.compile(r"(?:(php|python).*?)?((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)")

The pattern contains 2 capturing groups:

  • (?:(php|python).*?)? - the last ? makes this group optional, it matches and captures into Group 1 php or python, and then 0+ chars, as few as possible
  • ((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+) - this is Group 2 that is basically your pattern with no redundand groups.

If Group 1 matches, we need to return an empty result, else, Group 2 value:

def callback(v):
    m = pattern.search(v)
    if m and not m.group(1):
        return m.group(2)
    return ""

aa["test"].apply(lambda x: callback(x))

Result:

0    45 python 00222 sometext
1                            
2                            
3          45 python 50000 sm

Upvotes: 1

Related Questions