Reputation: 357
I have a list of keywords like the following:
keywords = {'dog', 'people', 'bird', 'snake', 'rabbit', 'forest'}
I would like to get the percentage of times all of these words appear in each row of a column in a pandas dataframe. Each row in the column contains a lot of texts. With the following code, I get the count of the selected keywords in the column named perc_words
. Is there any way to transform this into a percentage? Thanks much.
import pandas as pd
df['perc_words'] = df['text'].apply(lambda x: sum(i in keywords for i in str(x).split()))
Upvotes: 1
Views: 376
Reputation: 41407
You can use .str.count()
to count the occurrences of keywords
, then divide by .str.len()
:
df['perc_words'] = df.text.str.count('|'.join(keywords)) / df.text.str.split().str.len()
To get occurrences per 1000, you can multiply perc_words
by 1000:
df['per_1000'] = df.perc_words * 1000
Toy example:
df = pd.DataFrame({'text': ['dog apple', 'foo', 'people are people']})
# text
# 0 dog apple
# 1 foo
# 2 people are people
Count of keywords
:
df.text.str.count('|'.join(keywords))
# 0 1
# 1 0
# 2 2
# Name: text, dtype: int64
Count of total words:
df.text.str.split().str.len()
# 0 2
# 1 1
# 2 3
# Name: text, dtype: int64
Percentage of keywords
:
df['perc_words'] = df.text.str.count(r'|'.join(keywords)) / df.text.str.split().str.len()
df['per_1000'] = df.perc_words * 1000
# text perc_words per_1000
# 0 dog apple 0.500000 500.000000
# 1 foo 0.000000 0.000000
# 2 people are people 0.666667 666.666667
Upvotes: 1