Reputation: 10454
I want to detect whether a word is in a sentence using python regex. Also, want to be able to negate it.
import re
re.match(r'(?=.*\bfoo\b)', 'bar red foo here')
this code works but I do not understand why I need to put .*
in there.
Also to negate it, I do not know how to do that.
I've tried:
re.match(r'(?!=.*\bfoo\b)', 'bar red foo here')
but it does not work. My ultimate goal is to combine them like so:
re.match(r'(?=.*\bfoo\b)(?!=.*\bbar\b)', 'bar red foo here')
Upvotes: 3
Views: 5126
Reputation:
Update
Just found that Python re.match() has an implied ^
anchor.
In other words it will only match at the beginning of the string,
and strangely, unlike Java, does not require it to match the entire string.
Be warned though that combining a sequential positive and negative lookahead,
as in Stribnez answer, can give unintended results if not anchored to
something. Either to literal text or a BOS anchor ^
.
For general usage, don't rely on the fact that (or if), in some language
the match() function implies a BOS anchor ^
(and possibly EOS $
).
Put one (or both) in there at all times. This way it can be used
in search() as well. And is portable to other languages.
To see how negative and positive, in-series lookahead's can cause problems,
take this tricky standalone expression (?=.*\bfoo\b)(?!.*\bbar\b)
It can be examined like this:
Since it is in-series, both assertions have to be matched at the same
position in the string.
Given the same position in the string for both, the negative assertion
can be satisfied when it finds a place that downstream does not match it's contents.
Assuming that no anchoring exists, this leave's an opening upstream
(between the search position and the bar
literal in the example) for
the undesired content to exist, that will still satisfy the positive/negative
assertion pair.
Example:
(?=.*\bfoo\b)(?!.*\bbar\b)
matches
bar red foo
** Grp 0 - ( pos 1 , len 0 ) EMPTY
b<here>ar red foo
This shows that at position 1, both assertions are satisfied.
Conclusion(s):
1. Always use anchors, even if they are implied.
2. Avoid using any language's match() function, use search() instead.
End update
It doesn't matter if you use a positive or negative lookahead,
if you don't use the correct syntax, it won't work.
Look at this (?!=.*\bfoo\b)
This says that the next character can't be an equal sign =
followed by
a greedy number of characters up to the next foo
. This is not allowed.
So, it will not match = ab foo
, but it will match '=(here) ab foo'.
The next problem is that if you don't give the assertion anything to anchor on
it will use a bump-along to move the postion to a place between characters
that will satisfy it.
The corrections for the negative lookahead you are looking for is this
^(?!.*\bfoo\b)
For reference:
(?=..) Positive lookahead
(?<=..) Positive lookbehind
(?!..) Negative lookahead
(?<!..) Negative lookbehind
And, they can be mixed and nested anywhere.
Upvotes: 2
Reputation: 20336
You need the .*
because re.match()
tries to match the pattern to the beginning of the string. If you want to search the whole string, use re.search()
.
Just as you can do if re.search(...):
, you can also do if not re.search(...):
Upvotes: 1
Reputation: 626893
To detect if a word exists in a string you need a positive lookahead:
(?=.*\bfoo\b)
The .*
is necessary to enable searching farther than just at the string start (re.match
anchors the search at the string start).
To check if a string has no word in it, use a negative lookahead:
(?!.*\bbar\b)
^^^
So, combining them:
re.match(r'(?=.*\bfoo\b)(?!.*\bbar\b)', input)
will find a match in a string that contains a whole word foo
and does not contain a whole word bar
.
Upvotes: 4