Reputation: 1
I have a data frame that looks something like this:
Category | Topic |
---|---|
Category1 | Topic1 |
Category2 | Topic2 |
Category1 | Topic2 |
Category3 | Topic3 |
Category2 | Topic3 |
Category3 | Topic3 |
And I want an output like this:
Category | Topic | Frequency |
---|---|---|
Category1 | Topic1 | |
Topic2 | ||
Topic3 | ||
Catgeory2 | Topic1 | |
Topic2 | ||
Topic3 | ||
Category3 | Topic1 | |
Topic2 | ||
Topic3 |
I am just starting out with python and I'd really appreciate it if someone could help me out with this.
Upvotes: 0
Views: 124
Reputation: 26221
If the frequency
is meant to capture the frequency of topic within each category, then, a basic approch involves:
df.groupby('Category')['Topic'].value_counts(normalize=True)
Which is a Series
. For example, on your input data, we get:
Category Topic
Category1 Topic1 0.5
Topic2 0.5
Category2 Topic2 0.5
Topic3 0.5
Category3 Topic3 1.0
Name: Topic, dtype: float64
For an output organized as per your example, that appears to be a DataFrame
with three columns:
out = (
df
.groupby('Category')['Topic']
.value_counts(normalize=True)
.to_frame('frequency')
.reset_index()
)
Again, on your input sample:
>>> out
Category Topic frequency
0 Category1 Topic1 0.5
1 Category1 Topic2 0.5
2 Category2 Topic2 0.5
3 Category2 Topic3 0.5
4 Category3 Topic3 1.0
Upvotes: 1