Reputation: 91
I am working with the following pandas dataframe (from csv file):
----------------------------------------------
buyer_id itemX itemY bid1 bid2
----------------------------------------------
buyer1 item1 item2 52 32
buyer2 item1 item2 15 17
.. ..
buyer500 item1 item2 82 13
I have 500 buyers bidding for item1 and item2, the following 200 buyers bid for item3 and item4
buyer600 item3 item4 63 82
..
..
buyer800 item3 item4 40 12
I sum Bid1 and bid2 so I can sort to find max bids, Now I want to find and keep the max 5 bids only for each group.. group1(item 1 + item2) and group2(item3 + item4). How can I do it?
Here is my code so far,
import pandas as pd
df = pd.read)csv('myfile.csv')
sum = (df[df.columns[4:]].sum(1))
df['sum'] = calc
df.sort_values(['sort of sum'], ascending=[False], inplace=True)
df.to_csv('results.csv')
output should be something like
-----------------------------------------
buyer itemX itemY sum of bid
-----------------------------------------
buyer13 item1 item2 350
buyer2 item1 item2 283
buyer65 item1 item2 236
buyer602 item3 item4 80
buyer703 item3 item4 76
buyer640 item3 item4 69
Upvotes: 0
Views: 5828
Reputation: 292
One way to do this would be to create two dataframes, one for the item1 & 2 buyers, and another for the item3 & 4 buyers
df1 = df[df['itemX'] == 'item1']
df2 = df[df['itemX'] == 'item3']
Then you can create a new column which sums the bids
df1['sum_bids'] = df1['bid1'] + df1['bid2']
Then sort the dataframes using sort_values
sorted_df1 = df1.sort_values(['sum_bids'], ascending=False)
Then you can re-index them
sorted_df1.index = range(1,len(sorted_df1) + 1)
Then select first 5 rows
max_bids = sorted_df1[:5]
Upvotes: 2
Reputation: 19947
#create a total column, sort by item name and total bid, then take the top 5 for each group.
df.assign(total = df.bid1+df.bid2).sort_values(['itemX','total'],ascending=False).groupby('itemX').head(5)
Out[2566]:
buyer_id itemX itemY bid1 bid2 total
12 buyer61 item3 item4 44 60 104
11 buyer51 item3 item4 52 32 84
8 buyer21 item3 item4 15 55 70
10 buyer41 item3 item4 40 17 57
9 buyer31 item3 item4 52 3 55
6 buyer500 item1 item2 82 13 95
0 buyer1 item1 item2 52 32 84
2 buyer3 item1 item2 52 32 84
4 buyer5 item1 item2 52 32 84
5 buyer6 item1 item2 15 60 75
Upvotes: 1
Reputation: 198
First I'm going to make a sample dataframe that I think is similar to the one you describe.
bids_df = pd.DataFrame({'buyer_id': ['buyer' + str(i) for i in range(1, 501)] + ['buyer' + str(i) for i in range(600, 800)],
'itemX': ['item1'] * 500 + ['item3'] * 200,
'itemY': ['item2'] * 500 + ['item4'] * 200,
'bid1': [randint(10, 100) for _ in range(700)],
'bid2': [randint(10, 100) for _ in range(700)]})
bids_df = bids_df[['buyer_id', 'itemX', 'itemY', 'bid1', 'bid2']]
Then we can add a column for the sum of bids and sort: you've done this much in your question already.
bids_df['bid_sum'] = bids_df['bid1'] + bids_df['bid2']
bids_df = bids_df.sort_values(by=['bid_sum'], ascending=False)
Finally we can group the dataframe by item as you describe, and simply get the top 5 rows from each group (since we've already sorted them by bid sum).
bids_df.groupby(lambda x: 'grp1' if bids_df.loc[x, 'itemX'] == 'item1' else 'grp2')\
.head(5)\
.sort_values(by=['itemX', 'bid_sum'], ascending=[True, False])
This gives the resulting dataframe:
buyer_id itemX itemY bid1 bid2 bid_sum
60 buyer61 item1 item2 99 97 196
498 buyer499 item1 item2 98 97 195
470 buyer471 item1 item2 92 99 191
120 buyer121 item1 item2 98 93 191
50 buyer51 item1 item2 100 91 191
573 buyer673 item3 item4 100 94 194
639 buyer739 item3 item4 90 95 185
512 buyer612 item3 item4 89 94 183
691 buyer791 item3 item4 100 78 178
659 buyer759 item3 item4 87 91 178
Upvotes: 1