Reputation: 141

frequency of words in a list based on data in other columns

I have a dataframe that looks like the following

Utterance                         Frequency

Directions to Starbucks           1045
Show me directions to Starbucks   754
Give me directions to Starbucks   612
Navigate me to Starbucks          498
Display navigation to Starbucks   376
Direct me to Starbucks            201
Navigate to Starbucks             180

Here, there is some data that show utterances made by people, and how frequently these were said.

I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.

I'm trying to get a frequency of how many times individual words were uttered.

I tried using .value_counts(), but that only gave me the following

Utterance       Frequency

Starbucks       7
Directions      3
Navigate        2
.
.
.

Instead, I am trying to get the following output

Utterance       Frequency

Starbucks       3666
Directions      2411
Navigate        678
.
.
.

In other words, I'm trying to get the frequency of how many times they were uttered, not in how many rows they appear, which is what happeend with value.counts()

Thanks for any help!

Upvotes: 3

Answers (3)

rahlf23

Reputation: 9019

This should do the trick:

output = {}
for i in ['starbucks','directions','navigate']:
    output[i] = df[df['Utterance'].str.lower().str.contains(i)]['Frequency'].sum()

Yields:

{'starbucks': 3666, 'directions': 2411, 'navigate': 678}

Upvotes: 0

jpp

Reputation: 164793

For an O(n) complexity solution, use collections.Counter.

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns = ['Utterance', 'Frequency'])

c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
        .rename(columns={0: 'Count'})\
        .sort_values('Count', ascending=False)

Result

            Count
to           3666
Starbucks    3666
me           2065
directions   1366
Directions   1045
Show          754
Navigate      678
Give          612
Display       376
navigation    376
Direct        201

Explanation

Loop through each row and each word in each row.
For each word, increment the counter, using the word as key.
Create dataframe from resulting dictionary.

Upvotes: 2

jezrael

Reputation: 863451

I think need:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .groupby(level=0)
        .value_counts()
        .reset_index(name='new')
        .assign(Frequency = lambda x: x.Frequency * x['new'])
        .groupby('level_1', as_index=False)['Frequency'].sum()
        .rename(columns={'level_1':'Words'})
        )
print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

If each row contain only unique words solution is simplify:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )
print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

Explanation:

Create index from column Frequency
split sentences to words to DataFrame
Reshape by stack
Get counts per group by SeriesGroupBy.value_counts
Multiple count column with Frequency by assign
Aggregate sum by words with GroupBy.sum

Upvotes: 2

frequency of words in a list based on data in other columns

Answers (3)

Related Questions