Reputation: 117
I have a data frame with Country, Region, City, Product and sales in $, units. I need to obtain the top 3 product for each Country, Region, City and the remaining product under "other" with relevant sales and units
The end results is the top 3 products + "Other" for each combination of Country, Region, City
Country Region City Product Sales Val
Europe Italy Milan Ring 100 10
Europe Italy Milan Book 300 5
Europe Italy Milan Phone 1500 10
Europe Italy Milan Car 200 5
Europe Italy Milan Ring 100 10
Europe Italy Milan Pen 8200 5
Results for top 3:
Country Region City Product Sales Val
Europe Italy Milan Pen 8200 5
Europe Italy Milan Phone 1500 10
Europe Italy Milan Book 300 5
Europe Italy Milan Other 400 25
Upvotes: 1
Views: 145
Reputation: 862661
First is necessary create default index by reset_index
:
df = df.reset_index(drop=True)
Then sorting by Sales
column by sort_values
with GroupBy.head
get top3 rows per groups:
cols = ['Country','Region', 'City']
df1 = df.sort_values('Sales', ascending=False).groupby(cols).head(3)
print (df1)
Country Region City Product Sales Val
5 Europe Italy Milan Pen 8200 5
2 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5
Then filter out rows used for top3 and aggregate sum
:
df2 = df.loc[df.index.difference(df1.index)]
df2 = df2.groupby(cols, as_index=False).sum().assign(Product='Other')
print (df2)
Country Region City Sales Val Product
0 Europe Italy Milan 400 25 Other
Last join together by concat
:
df = pd.concat([df1, df2]).sort_values(cols).reset_index(drop=True)
print (df)
City Country Product Region Sales Val
0 Milan Europe Pen Italy 8200 5
1 Milan Europe Phone Italy 1500 10
2 Milan Europe Book Italy 300 5
3 Milan Europe Other Italy 400 25
Another solution:
print (df)
Country Region City Product Sales Val
0 Europe Italy Milan Ring 100 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Phone 1500 10
3 Europe Italy Milan Car 200 5
4 Europe Italy Milan Ring 100 10
5 Europe Italy Rome Pen 8200 5
6 Europe Italy Rome Ring 100 10
7 Europe Italy Rome Book 300 5
8 Europe Italy Rome Phone 1500 10
9 Europe Italy Rome Car 200 5
10 Europe Italy Rome Ring 100 10
11 Europe Italy Rome Pencil 8100 5
Idea is sorting values by Sales
and create counter column per groups by cumcount
and replace values of Product
to Other
:
cols = ['Country','Region', 'City']
df['g'] = df.sort_values('Sales', ascending=False).groupby(cols).cumcount()
df['Product'] = np.where(df['g'] >= 3 , 'Other', df['Product'])
print (df)
Country Region City Product Sales Val g
0 Europe Italy Milan Other 100 10 3
1 Europe Italy Milan Book 300 5 1
2 Europe Italy Milan Phone 1500 10 0
3 Europe Italy Milan Car 200 5 2
4 Europe Italy Milan Other 100 10 3
5 Europe Italy Rome Pen 8200 5 0
6 Europe Italy Rome Other 100 10 3
7 Europe Italy Rome Other 300 5 3
8 Europe Italy Rome Phone 1500 10 2
9 Europe Italy Rome Other 200 5 3
10 Europe Italy Rome Other 100 10 3
11 Europe Italy Rome Pencil 8100 5 1
Then aggregate by sum
:
df2 = (df.groupby(cols + ['Product'], as_index=False).sum()
.sort_values(cols + ['g'])
.drop('g', axis=1)
.reset_index(drop=True))
print (df2)
Country Region City Product Sales Val
0 Europe Italy Milan Phone 1500 10
1 Europe Italy Milan Book 300 5
2 Europe Italy Milan Car 200 5
3 Europe Italy Milan Other 200 20
4 Europe Italy Rome Pen 8200 5
5 Europe Italy Rome Pencil 8100 5
6 Europe Italy Rome Phone 1500 10
7 Europe Italy Rome Other 700 30
Upvotes: 1