Reputation: 2920
I have pandas dataframe in the following format:
df = pd.DataFrame(data={'item_code': ['Item1', 'Item2', 'Item3', 'Item1', 'Item1', 'Item2', 'Item2', 'Item1', 'Item3'],
'dollar_amount': ['200.25', '350.00', '120.00', '400.50', '1001.25', '700.00', '350.00', '200.25', '240.00'],
'supplier_code': ['Sup1','Sup1','Sup2','Sup1','Sup1','Sup1','Sup1','Sup2','Sup2']})
This is how this dataframe looks:
dollar_amount item_code supplier_code
0 200.25 Item1 Sup1
1 350.00 Item2 Sup1
2 120.00 Item3 Sup2
3 400.50 Item1 Sup1
4 1001.25 Item1 Sup1
5 700.00 Item2 Sup1
6 350.00 Item2 Sup1
7 200.25 Item1 Sup2
8 240.00 Item3 Sup2
I can get the top N Items summed up by dollar_amount
with this:
a = data.groupby('item_code', as_index=False).sum()
cnt_srs = a[['item_code','dollar_amount']].sort_values(by=['dollar_amount'], ascending=False).head(15)
cnt_srs
Sample output top-5 items (not of the above data):
item_code dollar_amount
173 Item1 8,776,906.5400
442 Item2 2,085,528.8170
367 Item3 2,033,746.0500
25 Item4 1,635,830.4040
537 Item5 1,485,672.4050
Now I want the supplier_code
s for these items.
I can get the supplier_code and its most frequently supplied item using this:
N = 1
df1 = data.groupby(['supplier_code'])['item_code'].value_counts().groupby('supplier_code').head(N)
df1
Sample Output:
supplier_code item_code
Sup1 Item4 9
Sup2 Item2 21
Sup3 Item1 7
Sup4 Item5 173
Sup5 Item3 1
Sup6 Item6 12
I want to get the top N Item and their Supplier where top N items are determined by the sum of their dollar amounts.
Expected Output:
item_code dollar_amount SupplierCode
TopItem1 8,776,906.5400 Sup4
TopItem2 2,085,528.8170 Sup1
TopItem3 2,033,746.0500 Sup3
TopItem4 1,635,830.4040 Sup1
TopItem5 1,485,672.4050 Sup2
I want the top N Items(In terms of dollar_amount) and the corresponding Supplier_codes
Thank you so much for all the help up till now.
Upvotes: 0
Views: 420
Reputation: 862491
You can use:
df1 = (df.groupby('item_code', as_index=False)
.agg({'dollar_amount':'sum', 'supplier_code': 'first'}))
print (df1)
item_code supplier_code dollar_amount
0 Item1 Sup1 1802.25
1 Item2 Sup1 1400.00
2 Item3 Sup2 360.00
Upvotes: 1
Reputation: 1151
Using your sample data and if you only want the first supplier_code, then below will suffice.
>>> print(df.groupby('item_code').agg({'dollar_amount':sum, 'supplier_code': lambda curr: curr.iloc[0]}).sort_values('dollar_amount', ascending=False).reset_index())
item_code dollar_amount supplier_code
0 Item1 1802.25 Sup1
1 Item2 1400.00 Sup1
2 Item3 360.00 Sup2
reset_index() is optional. I just used it to return a DataFrame.
But if you need to get the top supplier_code per item_code then I think you need to do a dual group_by.
>>> print(df.groupby(['item_code', 'supplier_code']).sum().sort_values('dollar_amount', ascending=False).reset_index().groupby('item_code').agg({'dollar_amount':sum, 'supplier_code': lambda curr: curr.iloc[0]}).reset_index())
item_code dollar_amount supplier_code
0 Item1 1802.25 Sup1
1 Item2 1400.00 Sup1
2 Item3 360.00 Sup2
Upvotes: 0