Reputation: 193
I have a set of words
words = {'thanks giving', 'cat', 'instead of',etc...}
I need to search exactly these words in table column 'description'
--------------------------------|
ID | Description |
--- |---------------------------|
1 | having fun thanks giving|
----|---------------------------|
2 | cat eats all the food |
----|---------------------------|
3 | instead you can come |
--------------------------------
def matched_words(x,words):
match_words =[]
for word in words:
if word in x:
match_words.append(word)
return match_words
df['new_col'] = df['description'].apply(lambda x:matched_words(x,words))
desired output :
----|---------------------------|-------------------|
ID | Description |matched words |
--- |---------------------------|-------------------|
1 | having fun thanks giving|['thanks giving'] |
----|---------------------------|------------------ |
2 | cat eats all the food |['cat'] |
----|---------------------------|-------------------|
3 | instead you can come | [] |
----------------------------------------------------
I'm getting matches only single tokens like ['cat']
Upvotes: 0
Views: 7112
Reputation: 1714
The following code should give you the results you're looking for:
import re
words = {'thanks', 'cat', 'instead of'}
phrases = [
[1,"having fun at thanksgiving"],
[2,"cater the food"],
[3, "instead you can come"],
[4, "instead of pizza"],
[5, "thanks for all the fish"]
]
matched_words = []
matched_pairs = []
for word in words:
for phrase in phrases:
result = re.search(r'\b'+word+'\W', phrase[1])
if result:
matched_words.append(result.group(0))
matched_pairs.append([result.group(0), phrase])
print()
print(matched_words)
print(matched_pairs)
The relevant part, that is, the regex
bit re.search(r'\b'+word+'\W', phrase[1])
, is searching for cases in which our search string is found beginning at a word boundary \b
, or empty string
, and ending in a non-word character \W
. This should ensure that we find only whole-string matches. No need to do anything else to the text you want to search.
Of course, you can use anything you want instead of words
, phrases
, matched_words
and matched_pairs
.
Hope this helps!
Upvotes: 1
Reputation: 49784
import re
words = {'thanks', 'cat', 'instead of'}
samples = [
(1, 'having fun at thanksgiving'),
(2, 'cater the food'),
(3, 'instead you can come'),
(4, 'instead of you can come'),
]
for id, description in samples:
for word in words:
if re.search(r'\b' + word + r'\b', description):
print("'%s' in '%s" % (word, description))
Upvotes: 0