Conditionally keep only one of the duplicates in pandas groupby groups

Question

I have a dataset in this format: (can be download in CSV format from here)

ID  DateAcquired    DateSent
1   20210518        20220110
1   20210719        20220210
1   20210719        20220310
1   20200420        20220410
1   20210328        20220510
1   20210518        20220610
2   20210108        20220110
2   20210110        20220210
2   20210119        20220310
2   20210108        20220410
2   20200109        20220510
2   20210919        20220610
2   20211214        20220612
2   20210812        20220620
2   20210909        20220630
2   20200102        20220811
2   20200608        20220909
2   20210506        20221005
2   20210130        20221101
3   20210518        20220110
3   20210519        20220210
3   20210520        20220310
3   20210518        20220410
3   20210611        20220510
3   20210521        20220610
3   20210723        20220612
3   20211211        20220620
4   20210518        20220110
4   20210519        20220210
4   20210520        20220310
4   20210618        20220410
4   20210718        20220510
4   20210818        20220610
5   20210518        20220110
5   20210818        20220210
5   20210918        20220310
5   20211018        20220410
5   20211113        20220510
5   20211218        20220610
5   20210631        20221212
6T  20200102        20201101
6T  20200102        20201101
6T  20200102        20201101
6T  20210405        20220610
6T  20210606        20220611

I am doing groupby:

data.groupby(['ID','DateAcquired'])

For each unique combination of ID and DateAcquired, I am only interested in keeping one DateSent, and that is the newest one. Therefore, in other words, if a unique combination of ID and DateAcquired has two DateSent available, only take the one where DateSent is the largest/newest. This operation should apply only if ID is NOT 6T.

I am out of ideas on how to do this. Is there an easy way of doing it with pandas?

jezrael · Accepted Answer

You can filter rows for not equal 6T and get maximum rows by DateSent by DataFrameGroupBy.idxmax and then append 6T rows to output:

m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
        .append(df[~m], ignore_index=True))

Solution with sorting and removing duplicates:

m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
           .drop_duplicates(subset=['ID','DateAcquired'])
           .append(df[~m], ignore_index=True))

Conditionally keep only one of the duplicates in pandas groupby groups

Answers (2)

Related Questions