crackers
crackers

Reputation: 357

Get percentage of selected words in a large corpus in dataframe

I have a list of keywords like the following:

keywords = {'dog', 'people', 'bird', 'snake', 'rabbit', 'forest'}

I would like to get the percentage of times all of these words appear in each row of a column in a pandas dataframe. Each row in the column contains a lot of texts. With the following code, I get the count of the selected keywords in the column named perc_words. Is there any way to transform this into a percentage? Thanks much.

import pandas as pd
df['perc_words'] = df['text'].apply(lambda x: sum(i in keywords for i in str(x).split()))

Upvotes: 1

Views: 376

Answers (1)

tdy
tdy

Reputation: 41407

You can use .str.count() to count the occurrences of keywords, then divide by .str.len():

df['perc_words'] = df.text.str.count('|'.join(keywords)) / df.text.str.split().str.len()

To get occurrences per 1000, you can multiply perc_words by 1000:

df['per_1000'] = df.perc_words * 1000

Toy example:

df = pd.DataFrame({'text': ['dog apple', 'foo', 'people are people']})

#                 text
# 0          dog apple
# 1                foo
# 2  people are people

Count of keywords:

df.text.str.count('|'.join(keywords))

# 0    1
# 1    0
# 2    2
# Name: text, dtype: int64

Count of total words:

df.text.str.split().str.len()

# 0    2
# 1    1
# 2    3
# Name: text, dtype: int64

Percentage of keywords:

df['perc_words'] = df.text.str.count(r'|'.join(keywords)) / df.text.str.split().str.len()
df['per_1000'] = df.perc_words * 1000

#                 text  perc_words    per_1000
# 0          dog apple    0.500000  500.000000
# 1                foo    0.000000    0.000000
# 2  people are people    0.666667  666.666667

Upvotes: 1

Related Questions