Reputation: 434
I have a csv file with data on store sales for each province, including the store ID. I've already figured out how to get a list of the provinces with the most sales, and a list of the stores with the most sales, but now I need to calculate: 1) The average store sales for each province and 2) The best-selling store in each province and then 3) The difference between them. The data looks like this:
>>> store_sales
sales
store_num province
1396 ONTARIO 223705.21
1891 ONTARIO 71506.85
4823 MANITOBA 114692.70
4861 MANITOBA 257.69
6905 ONTARIO 19713.24
6973 ONTARIO 336392.25
7104 BRITISH COLUMBIA 32233.31
7125 BRITISH COLUMBIA 11873.71
7167 BRITISH COLUMBIA 87488.70
7175 BRITISH COLUMBIA 14096.53
7194 BRITISH COLUMBIA 6327.60
7238 ALBERTA 1958.75
7247 ALBERTA 6231.31
7269 ALBERTA 451.56
7296 ALBERTA 184410.04
7317 SASKATCHEWAN 43491.55
8142 ONTARIO 429871.74
8161 ONTARIO 6479.71
9604 ONTARIO 20823.49
9609 ONTARIO 148.02
9802 ALBERTA 54101.00
9807 ALBERTA 543703.84
I was able to get there by using the following:
import pandas as pd
df = pd.read_csv('/path/to/sales.csv')
store_sales = df.groupby(['store_num', 'province']).agg({'sales': 'sum'})
I think 3) is probably pretty simple but for 1) I'm not sure how to apply an average to subsets of specific rows (I imagine it involves using 'groupby') and for 2) although I was able to generate a list of the best-selling stores, I'm uncertain as to how I could display a single top store for each province (although something tells me it should be simpler than it seems).
Upvotes: 0
Views: 31
Reputation: 9008
For (1), you just need to pass the column name to groupby
:
df.groupby("province).mean()
For (2), you just need to apply a different function to groupby
:
df.groupby("province).max()
For (3), the difference can be easily calculated by subtracting (1) and (2):
df.groupby("province").max() - df.groupby("province").mean()
Upvotes: 2