Reputation: 1422
I have a list of words
wordlist = ['hypothesis' , 'test' , 'results' , 'total']
I have a sentence
sentence = "These tests will benefit in the long run."
I want to check to see if the words in wordlist
are in the sentence. I know that you could check to see if they are substrings in the sentence using:
for word in wordlist:
if word in sentence:
print word
However, using substrings, I start to match words that are not in wordlist
, for example here test
will appear as a substring in sentence even though it is tests
that is in the sentence. I could solve my problem by using regular expressions, however, is it possible to implement regular expressions in a way to be formatted with each new word, meaning if I want to see if the word is in the sentence then:
for some_word_goes_in_here in wordlist:
if re.search('.*(some_word_goes_in_here).*', sentence):
print some_word_goes_in_here
so in this case the regular expression would interpret some_word_goes_in_here
as the pattern that needs to be searched for and not the value of some_word_goes_in_here
. Is there a way to format the input of some_word_goes_in_here
so that the regular expression searches for the value of some_word_goes_in_here
?
Upvotes: 1
Views: 355
Reputation: 1121484
Use \b
word boundaries to test for the words:
for word in wordlist:
if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
print '{} matched'.format(word)
but you could also just split the sentence into separate words. Using a set for the word list would make the test more efficient:
words = set(wordlist)
if words.intersection(sentence.split()):
# no looping over `words` required.
Demo:
>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
... if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
... print '{} matched'.format(word)
...
>>> words = set(wordlist)
>>> words.intersection(sentence.split())
set([])
>>> sentence = 'Lets test this hypothesis that the results total the outcome'
>>> for word in wordlist:
... if re.search(r'\b{}\b'.format(re.escape(word)), sentence):
... print '{} matched'.format(word)
...
hypothesis matched
test matched
results matched
total matched
>>> words.intersection(sentence.split())
set(['test', 'total', 'hypothesis', 'results'])
Upvotes: 2
Reputation: 71538
Try using:
if re.search(r'\b' + word + r'\b', sentence):
\b
are word boundaries which will match between your word and a non word character (a word character is any letter, digit or underscore).
For instance,
>>> import re
>>> wordlist = ['hypothesis' , 'test' , 'results' , 'total']
>>> sentence = "The total results for the test confirm the hypothesis"
>>> for word in wordlist:
... if re.search(r'\b' + word + r'\b', sentence):
... print word
...
hypothesis
test
results
total
With your string:
>>> sentence = "These tests will benefit in the long run."
>>> for word in wordlist:
... if re.search(r'\b' + word + r'\b', sentence):
... print word
...
>>>
Nothing is printed
Upvotes: 1
Reputation: 59426
I'd use this:
words = "hypothesis test results total".split()
# ^^^ but you can use your literal list if you prefer that
for word in words:
if re.search(r'\b%s\b' % (word,), sentence):
print word
You can even speed this up by using a single regexp:
for foundWord in re.findall(r'\b' + r'\b|\b'.join(words) + r'\b', sentence):
print foundWord
Upvotes: 1