lengthy_preamble
lengthy_preamble

Reputation: 434

Figure out average for multiple subsets of rows

I have a csv file with data on store sales for each province, including the store ID. I've already figured out how to get a list of the provinces with the most sales, and a list of the stores with the most sales, but now I need to calculate: 1) The average store sales for each province and 2) The best-selling store in each province and then 3) The difference between them. The data looks like this:

>>> store_sales
                                sales
store_num province                   
1396      ONTARIO           223705.21
1891      ONTARIO            71506.85
4823      MANITOBA          114692.70
4861      MANITOBA             257.69
6905      ONTARIO            19713.24
6973      ONTARIO           336392.25
7104      BRITISH COLUMBIA   32233.31
7125      BRITISH COLUMBIA   11873.71
7167      BRITISH COLUMBIA   87488.70
7175      BRITISH COLUMBIA   14096.53
7194      BRITISH COLUMBIA    6327.60
7238      ALBERTA             1958.75
7247      ALBERTA             6231.31
7269      ALBERTA              451.56
7296      ALBERTA           184410.04
7317      SASKATCHEWAN       43491.55
8142      ONTARIO           429871.74
8161      ONTARIO             6479.71
9604      ONTARIO            20823.49
9609      ONTARIO              148.02
9802      ALBERTA            54101.00
9807      ALBERTA           543703.84

I was able to get there by using the following:

import pandas as pd
df = pd.read_csv('/path/to/sales.csv')
store_sales = df.groupby(['store_num', 'province']).agg({'sales': 'sum'})

I think 3) is probably pretty simple but for 1) I'm not sure how to apply an average to subsets of specific rows (I imagine it involves using 'groupby') and for 2) although I was able to generate a list of the best-selling stores, I'm uncertain as to how I could display a single top store for each province (although something tells me it should be simpler than it seems).

Upvotes: 0

Views: 31

Answers (1)

Yilun Zhang
Yilun Zhang

Reputation: 9008

For (1), you just need to pass the column name to groupby:

df.groupby("province).mean()

For (2), you just need to apply a different function to groupby:

df.groupby("province).max()

For (3), the difference can be easily calculated by subtracting (1) and (2):

df.groupby("province").max() - df.groupby("province").mean()

Upvotes: 2

Related Questions