Reputation: 117

Pandas: Top N and the total for the remaining . This for each of the groups

I have a data frame with Country, Region, City, Product and sales in $, units. I need to obtain the top 3 product for each Country, Region, City and the remaining product under "other" with relevant sales and units

The end results is the top 3 products + "Other" for each combination of Country, Region, City

Country Region City Product Sales Val
Europe Italy Milan Ring 100 10 
Europe Italy Milan Book 300 5 
Europe Italy Milan Phone 1500 10 
Europe Italy Milan Car 200 5 
Europe Italy Milan Ring 100 10 
Europe Italy Milan Pen 8200 5

Results for top 3:

Country Region City Product Sales Val
Europe Italy Milan Pen 8200 5 
Europe Italy Milan Phone 1500 10 
Europe Italy Milan Book 300 5 
Europe Italy Milan Other 400 25

Upvotes: 1

Answers (1)

jezrael

Reputation: 862661

First is necessary create default index by reset_index:

df = df.reset_index(drop=True)

Then sorting by Sales column by sort_values with GroupBy.head get top3 rows per groups:

cols = ['Country','Region', 'City']
df1 = df.sort_values('Sales', ascending=False).groupby(cols).head(3)
print (df1)
  Country Region   City Product  Sales  Val
5  Europe  Italy  Milan     Pen   8200    5
2  Europe  Italy  Milan   Phone   1500   10
1  Europe  Italy  Milan    Book    300    5

Then filter out rows used for top3 and aggregate sum:

df2 = df.loc[df.index.difference(df1.index)]
df2 = df2.groupby(cols, as_index=False).sum().assign(Product='Other')
print (df2)
  Country Region   City  Sales  Val Product
0  Europe  Italy  Milan    400   25   Other

Last join together by concat:

df = pd.concat([df1, df2]).sort_values(cols).reset_index(drop=True)
print (df)
    City Country Product Region  Sales  Val
0  Milan  Europe     Pen  Italy   8200    5
1  Milan  Europe   Phone  Italy   1500   10
2  Milan  Europe    Book  Italy    300    5
3  Milan  Europe   Other  Italy    400   25

Another solution:

print (df)
   Country Region   City Product  Sales  Val
0   Europe  Italy  Milan    Ring    100   10
1   Europe  Italy  Milan    Book    300    5
2   Europe  Italy  Milan   Phone   1500   10
3   Europe  Italy  Milan     Car    200    5
4   Europe  Italy  Milan    Ring    100   10
5   Europe  Italy   Rome     Pen   8200    5
6   Europe  Italy   Rome    Ring    100   10
7   Europe  Italy   Rome    Book    300    5
8   Europe  Italy   Rome   Phone   1500   10
9   Europe  Italy   Rome     Car    200    5
10  Europe  Italy   Rome    Ring    100   10
11  Europe  Italy   Rome  Pencil   8100    5

Idea is sorting values by Sales and create counter column per groups by cumcount and replace values of Product to Other:

cols = ['Country','Region', 'City']
df['g'] = df.sort_values('Sales', ascending=False).groupby(cols).cumcount()
df['Product'] = np.where(df['g'] >= 3 , 'Other', df['Product'])
print (df)
   Country Region   City Product  Sales  Val  g
0   Europe  Italy  Milan   Other    100   10  3
1   Europe  Italy  Milan    Book    300    5  1
2   Europe  Italy  Milan   Phone   1500   10  0
3   Europe  Italy  Milan     Car    200    5  2
4   Europe  Italy  Milan   Other    100   10  3
5   Europe  Italy   Rome     Pen   8200    5  0
6   Europe  Italy   Rome   Other    100   10  3
7   Europe  Italy   Rome   Other    300    5  3
8   Europe  Italy   Rome   Phone   1500   10  2
9   Europe  Italy   Rome   Other    200    5  3
10  Europe  Italy   Rome   Other    100   10  3
11  Europe  Italy   Rome  Pencil   8100    5  1

Then aggregate by sum:

df2 = (df.groupby(cols + ['Product'], as_index=False).sum()
         .sort_values(cols + ['g'])
         .drop('g', axis=1)
         .reset_index(drop=True))
print (df2)
  Country Region   City Product  Sales  Val
0  Europe  Italy  Milan   Phone   1500   10
1  Europe  Italy  Milan    Book    300    5
2  Europe  Italy  Milan     Car    200    5
3  Europe  Italy  Milan   Other    200   20
4  Europe  Italy   Rome     Pen   8200    5
5  Europe  Italy   Rome  Pencil   8100    5
6  Europe  Italy   Rome   Phone   1500   10
7  Europe  Italy   Rome   Other    700   30

Upvotes: 1

Pandas: Top N and the total for the remaining . This for each of the groups

Answers (1)

Related Questions