Reputation: 3634
I want to find a better way to get my result. I use a regex pattern
to match all text of the form (DD+ some text DDDD some other text)
if and only if it is not preceded of non-fixed width lookbehind terms. How can I include these terms inside of my REGEX pattern
?
aa = pd.DataFrame({"test": ["45 python 00222 sometext",
"python white 45 regex 00 222 somewhere",
"php noise 45 python 65000 sm",
"otherword 45 python 50000 sm"]})
pattern = re.compile("(((\d+)\s?([^\W\d_]+)\s?)?(\d{2}\s?\d{3})\s?(\w.+))")
aa["result"] = aa["test"].apply(lambda x: pattern.search(x)[0] if pattern.search(x) else None)
lookbehind = ['python', 'php']
aa.apply(lambda x: "" if any(look in x["test"].replace(x["result"], "") for look in lookbehind) else x["result"], axis=1)
The output is what I expected
0 45 python 00222 sometext
1
2
3 45 python 50000 sm
Upvotes: 2
Views: 1677
Reputation: 30991
As negative lookbehind must be of fixed length, you have to use negative lookahead, anchored to the start of string, checking the part before the first digit.
It should include:
This way, if the string to check contains python or php before the first digit, this lookahead will fail, preventing this string from further processing.
Because of the ^
anchor, the rest of regex must first match a sequence
of non-digits (what is before "DD+" part) and then there should be your
regex.
So the regex to use is as follows:
^(?!\D*(?:python|php))\D*(\d+)\s?([^\W\d_]+)\s?(\d{2}\s?\d{3})\s?(\w+)
Details:
^(?!
- Start of string and negative lookahead for:
\D*
- A sequence of non-digits (may be empty).(?:python|php)
- Either of the "forbidden" strings, as a non-capturing
group (no need to capture it).)
- End of negative lookahead.\D*
- A sequence of non-digits (before what you want to match).(\d+)\s?
- The first sequence of digits + optional space.([^\W\d_]+)\s?
- Some text No 1 + optional space.(\d{2}\s?\d{3})\s?
- The second sequence of digits (with optional
space in the middle) + optional space.(\w+)
- Some text No 2.The advantage of my solution over the other is that you are free from checking whether the first group matched. Here you get only "positive" cases, which do not require any check.
For a working example see https://regex101.com/r/gl9nWx/1
Upvotes: 1
Reputation: 626926
You may use a hack that consists in capturing php
or python
before the expected match, and if the group is not empty (if it matched), discard the current match, else, the match is valid.
See
pattern = re.compile(r"(?:(php|python).*?)?((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)")
The pattern contains 2 capturing groups:
(?:(php|python).*?)?
- the last ?
makes this group optional, it matches and captures into Group 1 php
or python
, and then 0+ chars, as few as possible((?:\d+\s?[^\W\d_]+\s?)?\d{2}\s?\d{3}\s?\w.+)
- this is Group 2 that is basically your pattern with no redundand groups.If Group 1 matches, we need to return an empty result, else, Group 2 value:
def callback(v):
m = pattern.search(v)
if m and not m.group(1):
return m.group(2)
return ""
aa["test"].apply(lambda x: callback(x))
Result:
0 45 python 00222 sometext
1
2
3 45 python 50000 sm
Upvotes: 1