Reputation: 329
Please see pandas df:
pd.DataFrame({'id': [1, 1, 2, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Jul8', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 1, 2, 3, 1]})
I am trying to groupby 'id' and 'pay_date'. I only want to keep df['id_ind'].nlargest(2) in the dataframe after grouping by 'id' and 'pay_date'. Here is my code:
df = pd.DataFrame(df.groupby(['id', 'pay_date'])['id_ind'].apply(
lambda x: x.nlargest(2)).reset_index()
This does not work, as the new df returns all the records. If it worked, 'id'==2 would only appear twice in the df, as there are 3 records and I only want the 2 largest by 'id_ind'.
My desired output:
pd.DataFrame({'id': [1, 1, 2, 2, 3],
'pay_date': ['Jul1', 'Jul2', 'Aug5', 'Aug7', 'Aug22'],
'id_ind': [1, 2, 2, 3, 1]})
Upvotes: 1
Views: 492
Reputation: 25239
Sort on id_ind
and doing groupby.tail
df_final = (df.sort_values('id_ind').groupby('id').tail(2)
.sort_index()
.reset_index(drop=True))
Out[29]:
id id_ind pay_date
0 1 1 Jul1
1 1 2 Jul2
2 2 2 Aug5
3 2 3 Aug7
4 3 1 Aug22
Upvotes: 2