Reputation: 141
I have a dataframe that looks like the following
Utterance Frequency
Directions to Starbucks 1045
Show me directions to Starbucks 754
Give me directions to Starbucks 612
Navigate me to Starbucks 498
Display navigation to Starbucks 376
Direct me to Starbucks 201
Navigate to Starbucks 180
Here, there is some data that show utterances made by people, and how frequently these were said.
I.e., "Directions to Starbucks" was uttered 1045 times, "Show me directions to Starbucks" was uttered 754 times, etc.
I'm trying to get a frequency of how many times individual words were uttered.
I tried using .value_counts()
, but that only gave me the following
Utterance Frequency
Starbucks 7
Directions 3
Navigate 2
.
.
.
Instead, I am trying to get the following output
Utterance Frequency
Starbucks 3666
Directions 2411
Navigate 678
.
.
.
In other words, I'm trying to get the frequency of how many times they were uttered, not in how many rows they appear, which is what happeend with value.counts()
Thanks for any help!
Upvotes: 3
Views: 149
Reputation: 9019
This should do the trick:
output = {}
for i in ['starbucks','directions','navigate']:
output[i] = df[df['Utterance'].str.lower().str.contains(i)]['Frequency'].sum()
Yields:
{'starbucks': 3666, 'directions': 2411, 'navigate': 678}
Upvotes: 0
Reputation: 164793
For an O(n) complexity solution, use collections.Counter
.
from collections import Counter
import pandas as pd
df = pd.DataFrame([['Directions to Starbucks', 1045],
['Show me directions to Starbucks', 754],
['Give me directions to Starbucks', 612],
['Navigate me to Starbucks', 498],
['Display navigation to Starbucks', 376],
['Direct me to Starbucks', 201],
['Navigate to Starbucks', 180]],
columns = ['Utterance', 'Frequency'])
c = Counter()
for row in df.itertuples():
for i in row[1].split():
c[i] += row[2]
res = pd.DataFrame.from_dict(c, orient='index')\
.rename(columns={0: 'Count'})\
.sort_values('Count', ascending=False)
Result
Count
to 3666
Starbucks 3666
me 2065
directions 1366
Directions 1045
Show 754
Navigate 678
Give 612
Display 376
navigation 376
Direct 201
Explanation
Upvotes: 2
Reputation: 863451
I think need:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.groupby(level=0)
.value_counts()
.reset_index(name='new')
.assign(Frequency = lambda x: x.Frequency * x['new'])
.groupby('level_1', as_index=False)['Frequency'].sum()
.rename(columns={'level_1':'Words'})
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
If each row contain only unique words solution is simplify:
df = (df.set_index('Frequency')['Utterance']
.str.split(expand=True)
.stack()
.reset_index(name='Words')
.groupby('Words', as_index=False)['Frequency'].sum()
)
print (df)
Words Frequency
0 Direct 201
1 Directions 1045
2 Display 376
3 Give 612
4 Navigate 678
5 Show 754
6 Starbucks 3666
7 directions 1366
8 me 2065
9 navigation 376
10 to 3666
Explanation:
Frequency
split
sentences to words to DataFrame
stack
SeriesGroupBy.value_counts
Frequency
by assign
sum
by words with GroupBy.sum
Upvotes: 2