krits
krits

Reputation: 68

How to count word or word_group occurrences in a string (phrase)

I have keywords as a column in a dataframe (D1), which are 1-gram, 2-gram and in some cases 3-grams as well. I need to search for these grams in another dataframe (D2) column as having Phrases and count the occurence of the n-grams, so as to provide them with some weightage.

I tried using nested looping, but it is too much computational expensive, also, the results which i get are pretty disappointing, single characters such as 'a' 'in' are also getting matched.

word_list = data['Words'].values.tolist() #converting the keywords into a list
s = pd.Series({w: pos_phrases.Phrases.str.contains(w, flags=re.IGNORECASE).sum() for w in word_list})  

The phrases are in pos_phrases under Phrases. Some of the keywords are:

'high-fidelity', 'hi-fi', 'surgical', 'straight', 'true', 'dead on target','wide of the mark', etc.

Phrases are just like conversation between two people. e.g.,

Sample Phrase: "Hello Good evening, how are you, so can you point out the facts which lead to this eventful night"
Keywords to match: "Good evening", "eventful", "event"

here, "event" must not match, because it is part of "eventful". However, it is getting matched. I hope i am able to explain my requirement.

Upvotes: 0

Views: 102

Answers (1)

SanV
SanV

Reputation: 945

A clean, simple way to manage this is using regular expressions as follows:

import re

Phrase = "Hello Good evening, how are you, so can you point out the facts which lead to this eventful night"
Words = "Good evening, eventful, event"

word_list = Words.split(', ')

for word in word_list:
    pattern =  r"\b" + word + r"\b" 
    matches = re.finditer(pattern, Phrase, re.MULTILINE | re.IGNORECASE)
    print(word, ': ', len([match.group() for match in matches]))  

Output:  
## Good evening :  1
## eventful :  1
## event :  0  

Upvotes: 1

Related Questions