user_seaweed
user_seaweed

Reputation: 141

frequency of words in a list based on data in other columns

I have a dataframe that looks like the following

Utterance                         Frequency   
Directions to Starbucks           1045
Show me directions to Starbucks   754
Give me directions to Starbucks   612
Navigate me to Starbucks          498
Display navigation to Starbucks   376
Direct me to Starbucks            201
Navigate to Starbucks             180

Here, there is some data that show utterances made by people, and how frequently these were said.

I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.

I'm trying to get a frequency of how many times individual words were uttered.

I tried using .value_counts(), but that only gave me the following

Utterance       Frequency   
Starbucks       7
Directions      3
Navigate        2
.
.
.

Instead, I am trying to get the following output

Utterance       Frequency   
Starbucks       3666
Directions      2411
Navigate        678
.
.
.

In other words, I'm trying to get the frequency of how many times they were uttered, not in how many rows they appear, which is what happeend with value.counts()

Thanks for any help!

Upvotes: 3

Views: 149

Answers (3)

rahlf23
rahlf23

Reputation: 9019

This should do the trick:

output = {}
for i in ['starbucks','directions','navigate']:
    output[i] = df[df['Utterance'].str.lower().str.contains(i)]['Frequency'].sum()

Yields:

{'starbucks': 3666, 'directions': 2411, 'navigate': 678}

Upvotes: 0

jpp
jpp

Reputation: 164793

For an O(n) complexity solution, use collections.Counter.

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns = ['Utterance', 'Frequency'])

c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
        .rename(columns={0: 'Count'})\
        .sort_values('Count', ascending=False)

Result

            Count
to           3666
Starbucks    3666
me           2065
directions   1366
Directions   1045
Show          754
Navigate      678
Give          612
Display       376
navigation    376
Direct        201

Explanation

  • Loop through each row and each word in each row.
  • For each word, increment the counter, using the word as key.
  • Create dataframe from resulting dictionary.

Upvotes: 2

jezrael
jezrael

Reputation: 863451

I think need:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .groupby(level=0)
        .value_counts()
        .reset_index(name='new')
        .assign(Frequency = lambda x: x.Frequency * x['new'])
        .groupby('level_1', as_index=False)['Frequency'].sum()
        .rename(columns={'level_1':'Words'})
        )
print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

If each row contain only unique words solution is simplify:

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )
print (df)
         Words  Frequency
0       Direct        201
1   Directions       1045
2      Display        376
3         Give        612
4     Navigate        678
5         Show        754
6    Starbucks       3666
7   directions       1366
8           me       2065
9   navigation        376
10          to       3666

Explanation:

  1. Create index from column Frequency
  2. split sentences to words to DataFrame
  3. Reshape by stack
  4. Get counts per group by SeriesGroupBy.value_counts
  5. Multiple count column with Frequency by assign
  6. Aggregate sum by words with GroupBy.sum

Upvotes: 2

Related Questions