Reputation: 68
I have keywords as a column in a dataframe (D1), which are 1-gram, 2-gram and in some cases 3-grams as well. I need to search for these grams in another dataframe (D2) column as having Phrases and count the occurence of the n-grams, so as to provide them with some weightage.
I tried using nested looping, but it is too much computational expensive, also, the results which i get are pretty disappointing, single characters such as 'a' 'in' are also getting matched.
word_list = data['Words'].values.tolist() #converting the keywords into a list
s = pd.Series({w: pos_phrases.Phrases.str.contains(w, flags=re.IGNORECASE).sum() for w in word_list})
The phrases are in pos_phrases under Phrases. Some of the keywords are:
'high-fidelity', 'hi-fi', 'surgical', 'straight', 'true', 'dead on target','wide of the mark', etc.
Phrases are just like conversation between two people. e.g.,
Sample Phrase: "Hello Good evening, how are you, so can you point out the facts which lead to this eventful night"
Keywords to match: "Good evening", "eventful", "event"
here, "event" must not match, because it is part of "eventful". However, it is getting matched. I hope i am able to explain my requirement.
Upvotes: 0
Views: 102
Reputation: 945
A clean, simple way to manage this is using regular expressions as follows:
import re
Phrase = "Hello Good evening, how are you, so can you point out the facts which lead to this eventful night"
Words = "Good evening, eventful, event"
word_list = Words.split(', ')
for word in word_list:
pattern = r"\b" + word + r"\b"
matches = re.finditer(pattern, Phrase, re.MULTILINE | re.IGNORECASE)
print(word, ': ', len([match.group() for match in matches]))
Output:
## Good evening : 1
## eventful : 1
## event : 0
Upvotes: 1