Brian
Brian

Reputation: 57

How to count the word occurence (from words in specific list) and store the results in a new column in a Pandas Dataframe in Python?

I currently have a list of words about MMA.

I want to create a new column in my Pandas Dataframe called 'MMA Related Word Count'. I want to analyze the column 'Speech' for each row and sum up how often words (from the list under here) occurred within the speech. Does anyone know the best way to do this? I'd love to hear it, thanks in advance!

Please take a look at my dataframe.

CODE EXAMPLE:

import pandas as pd

mma_related_words = ['mma', 'ju jitsu', 'boxing']

data = {
  "Name": ['Dana White', 'Triple H'],
  "Speech": ['mma is a fantastic sport. ju jitsu makes you better as a person.', 'Boxing sucks. Professional dancing is much better.']
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

CURRENT DATAFRAME:

Name Speech
Dana White mma is a fantastic sport. ju jitsu makes you better as a person.
Triple H boxing sucks. Professional wrestling is much better.

--

EXPECTED OUTPUT: Exactly same as above. But at right side new column with 'MMA Related Word Count'. For Dana White: value 2. For Triple H I want value 1.

Upvotes: 1

Views: 46

Answers (2)

Sachin Kohli
Sachin Kohli

Reputation: 1986

Using simple loop in apply lambda function shall work; Try this;

def fun(string):
    cnt = 0
    for w in mma_related_words:
        if w.lower() in string.lower():
            cnt = cnt + 1
    return cnt

df['MMA Related Word Count'] = df['Speech'].apply(lambda x: fun(string=x))

Same can also be written as;

df['MMA Related Word Count1'] = df['Speech'].apply(lambda x: sum([1 for w in mma_related_words if w.lower() in str(x).lower()]))

Output of df;

enter image description here

Upvotes: 0

mozway
mozway

Reputation: 262234

You can use a regex with str.count:

import re
regex = '|'.join(map(re.escape, mma_related_words))
# 'mma|ju\\ jitsu|boxing'

df['Word Count'] = df['Speech'].str.count(regex, flags=re.I)
# or
# df['Word Count'] = df['Speech'].str.count(r'(?i)'+regex)

output:

         Name                                             Speech  Word Count
0  Dana White  mma is a fantastic sport. ju jitsu makes you b...           2
1    Triple H  Boxing sucks. Professional dancing is much bet...           1

Upvotes: 2

Related Questions