Reputation: 1605
Trying to create regular expression that finds the key word within one long string and when the key word is not surrounded by an letter. If the string is surrounded by a dash or underscore as long as it is not surround by letter. Only need to find one occurrence of the word, to consider match. Only care about finding it in one long string. Currently, I can't get it to go True for when the word has a '_' next to it. Any ideas for a better expression?
Edit- I found a case where I need it to be true and did not add it to the example.
import re
key_words = ['go', 'at', 'why', 'stop' ]
false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]
positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop','something-stop', 'something_stop']
pattern = r"\b(%s)\b" % '|'.join(key_words)
for word in false_match + positive_match:
if re.match(pattern,word):
print True, word
else:
print False, word
Current Output:
False going_get_that
False that_is_wstop
False whysper
False stoping_tat
True go-around
False go_at_going
True stop-by_the_store
True stop
Edit - This needs to be True
False something-stop
False something_stop
Desired output:
False going_get_that
False that_is_wstop
False whysper
False stoping_tat
True go-around
True go_at_going
True stop-by_the_store
True stop
True something-stop
True something_stop
Upvotes: 2
Views: 60
Reputation: 107307
Your pattern is close , you can use the following pattern :
pattern = r"\b([a-zA-Z]+[-_])?(%s)([-_][a-zA-Z_-]+)?\b" % '|'.join(key_words)
Upvotes: 0
Reputation: 27283
Use negative look(ahead|behind)s:
import re
key_words = ['go', 'at', 'why', 'stop' ]
false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]
positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop', 'something-stop', 'something_stop']
pattern = r"(?<![a-zA-Z])(%s)(?![a-zA-Z])" % '|'.join(key_words)
for word in false_match + positive_match:
if re.search(pattern,word):
print True, word
else:
print False, word
Upvotes: 1
Reputation: 67968
The problem with \b
is it considers _
to be a part of \w
.So it is not a word boundary.To negate that effect make your own character class.
(?:^|(?<=[^a-zA-Z0-9]))(go|at|why|stop)(?=[^a-zA-Z0-9]|$)
Try this.See demo.
https://regex101.com/r/oC3qA3/8
import re
key_words = ['go', 'at', 'why', 'stop' ]
false_match = ['going_get_that', 'that_is_wstop', 'whysper','stoping_tat' ]
positive_match = ['go-around', 'go_at_going','stop-by_the_store', 'stop']
pattern = r"(?:^|(?<=[^a-zA-Z0-9]))(%s)(?=[^a-zA-Z0-9]|$)" % '|'.join(key_words)
for word in false_match + positive_match:
if re.findall(pattern,word):
print True, word
else:
print False, word
Output:False going_get_that
False that_is_wstop
False whysper
False stoping_tat
True go-around
True go_at_going
True stop-by_the_store
True stop
Upvotes: 0