Reputation: 57
I currently have a list of words about MMA.
I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'. I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech. Does anyone know the best way to do this? I'd love to hear it, thanks in advance!
Please take a look at my dataframe.
CODE EXAMPLE:
import pandas as pd
mma_related_words = ['mma', 'ju jitsu', 'boxing']
data = {
"Name": ['Dana White', 'Triple H'],
"Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}
#load data into a DataFrame object:
df = pd.DataFrame(data)
print(df)
CURRENT DATAFRAME:
Name | Speech |
---|---|
Dana White | mma is a fantastic sport. ju jitsu makes you better as a person. |
Triple H | boxing sucks. Professional wrestling is much better. |
--
EXPECTED OUTPUT: Exactly same as above. But at right side new column with 'MMA Related Word Count'. For Dana White: value 2. For Triple H I want value 1.
Upvotes: 1
Views: 46
Reputation: 1986
Using simple loop in apply lambda function shall work; Try this;
def fun(string):
cnt = 0
for w in mma_related_words:
if w.lower() in string.lower():
cnt = cnt + 1
return cnt
df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))
Same can also be written as;
df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))
Output of df;
Upvotes: 0
Reputation: 262234
You can use a regex with str.count
:
import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'
df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)
output:
Name Speech Word Count
0 Dana White mma is a fantastic sport. ju jitsu makes you b... 2
1 Triple H Boxing sucks. Professional dancing is much bet... 1
Upvotes: 2