Qaswed
Qaswed

Reputation: 3889

How to get relative frequencies from pandas groupby, with two grouping variables?

Suppose my data look as follows:

import datetime
import pandas as pd
df = pd.DataFrame({'datetime': [datetime.datetime(2024, 11, 27, 0), datetime.datetime(2024, 11, 27, 1), datetime.datetime(2024, 11, 28, 0),
                               datetime.datetime(2024, 11, 28, 1), datetime.datetime(2024, 11, 28, 2)],
                  'product': ['Apple', 'Banana', 'Banana', 'Apple', 'Banana']})



    datetime            product
0   2024-11-27 00:00:00 Apple
1   2024-11-27 01:00:00 Banana
2   2024-11-28 00:00:00 Banana
3   2024-11-28 01:00:00 Apple
4   2024-11-28 02:00:00 Banana


All I want is to plot the relative frequencies of the products sold at each day. In this example 1/2 (50%) of apples and 1/2 of bananas on day 2024-11-27. And 1/3 apples and 2/3 bananas on day 2024-11-28


What I managed to do:

absolute_frequencies = df.groupby([pd.Grouper(key='datetime', freq='D'), 'product']).size().reset_index(name='count')
total_counts = absolute_frequencies.groupby('datetime')['count'].transform('sum')
absolute_frequencies['relative_frequency'] = absolute_frequencies['count'] / total_counts
absolute_frequencies.pivot(index='datetime', columns='product', values='relative_frequency').plot()

I am pretty confident, there is a much less complicated way, since for the absolute frequencies I simply can use:

df.groupby([pd.Grouper(key='datetime', freq='D'), 'product']).size().unstack('product').plot(kind='line')

Upvotes: 2

Views: 64

Answers (2)

samhita
samhita

Reputation: 3490

1.Group by day and product

2.Counts the number of occurrences of each product per day

3.Normalizes the counts per day, i.e., converts them to relative frequencies by dividing by the sum of counts per day.

4.Converts the product column into separate columns for each product.

import datetime
import pandas as pd


df = pd.DataFrame({'datetime': [datetime.datetime(2024, 11, 27, 0), datetime.datetime(2024, 11, 27, 1), datetime.datetime(2024, 11, 28, 0),
                               datetime.datetime(2024, 11, 28, 1), datetime.datetime(2024, 11, 28, 2)],
                  'product': ['Apple', 'Banana', 'Banana', 'Apple', 'Banana']})

relative_frequencies = df.groupby([pd.Grouper(key='datetime', freq='D'), 'product']) \
                         .size() \
                         .groupby(level=0) \
                         .apply(lambda x: x / x.sum()) \
                         .unstack('product')
print(relative_frequencies)
ax = relative_frequencies.plot.bar(rot=45, figsize=(10, 6))

date_labels = [x.strftime('%b %d') for x in relative_frequencies.index.get_level_values(0)]

ax.set_xticklabels(date_labels, rotation=45)

# Optional: add gridlines to improve readability
ax.grid(True)

ax.set_title('Relative Frequencies of Products Sold per Day')
ax.set_xlabel('Date')
ax.set_ylabel('Relative Frequency')


plt.tight_layout()  
plt.show()

Output

product      Apple     Banana
datetime                         
2024-11-27   0.500000  0.500000
2024-11-28   0.333333  0.666667

Edited plot as per comment

enter image description here

Upvotes: 0

mozway
mozway

Reputation: 262194

You can use a crosstab with normalize:

ct = pd.crosstab(df['datetime'].dt.normalize(), df['product'], normalize='index')

Output:

product        Apple    Banana
datetime                      
2024-11-27  0.500000  0.500000
2024-11-28  0.333333  0.666667

As a graph:

ct.plot.bar()

Output:

enter image description here

Upvotes: 1

Related Questions