paropunam
paropunam

Reputation: 488

Conditionally keep only one of the duplicates in pandas groupby groups

I have a dataset in this format: (can be download in CSV format from here)

ID  DateAcquired    DateSent
1   20210518        20220110
1   20210719        20220210
1   20210719        20220310
1   20200420        20220410
1   20210328        20220510
1   20210518        20220610
2   20210108        20220110
2   20210110        20220210
2   20210119        20220310
2   20210108        20220410
2   20200109        20220510
2   20210919        20220610
2   20211214        20220612
2   20210812        20220620
2   20210909        20220630
2   20200102        20220811
2   20200608        20220909
2   20210506        20221005
2   20210130        20221101
3   20210518        20220110
3   20210519        20220210
3   20210520        20220310
3   20210518        20220410
3   20210611        20220510
3   20210521        20220610
3   20210723        20220612
3   20211211        20220620
4   20210518        20220110
4   20210519        20220210
4   20210520        20220310
4   20210618        20220410
4   20210718        20220510
4   20210818        20220610
5   20210518        20220110
5   20210818        20220210
5   20210918        20220310
5   20211018        20220410
5   20211113        20220510
5   20211218        20220610
5   20210631        20221212
6T  20200102        20201101
6T  20200102        20201101
6T  20200102        20201101
6T  20210405        20220610
6T  20210606        20220611

I am doing groupby:

data.groupby(['ID','DateAcquired'])

For each unique combination of ID and DateAcquired, I am only interested in keeping one DateSent, and that is the newest one. Therefore, in other words, if a unique combination of ID and DateAcquired has two DateSent available, only take the one where DateSent is the largest/newest. This operation should apply only if ID is NOT 6T.

I am out of ideas on how to do this. Is there an easy way of doing it with pandas?

Upvotes: 1

Views: 271

Answers (2)

jezrael
jezrael

Reputation: 862661

You can filter rows for not equal 6T and get maximum rows by DateSent by DataFrameGroupBy.idxmax and then append 6T rows to output:

m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
        .append(df[~m], ignore_index=True))

Solution with sorting and removing duplicates:

m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
           .drop_duplicates(subset=['ID','DateAcquired'])
           .append(df[~m], ignore_index=True))

Upvotes: 2

Mayank Porwal
Mayank Porwal

Reputation: 34056

Use pd.to_datetime with Groupby.max:

In [835]: df.DateSent = pd.to_datetime(df.DateSent, format='%Y%m%d')
In [841]: df[df.ID.ne('6T')].groupby(['ID','DateAcquired'])['DateSent'].max().reset_index().append(df[df.ID.eq('6T')])
Out[841]: 
    ID  DateAcquired   DateSent
0    1      20200420 2022-04-10
1    1      20210328 2022-05-10
2    1      20210518 2022-06-10
3    1      20210719 2022-03-10
4    2      20200102 2022-08-11
5    2      20200109 2022-05-10
6    2      20200608 2022-09-09
7    2      20210108 2022-04-10
8    2      20210110 2022-02-10
9    2      20210119 2022-03-10
10   2      20210130 2022-11-01
11   2      20210506 2022-10-05
12   2      20210812 2022-06-20
13   2      20210909 2022-06-30
14   2      20210919 2022-06-10
15   2      20211214 2022-06-12
16   3      20210518 2022-04-10
17   3      20210519 2022-02-10
18   3      20210520 2022-03-10
19   3      20210521 2022-06-10
20   3      20210611 2022-05-10
21   3      20210723 2022-06-12
22   3      20211211 2022-06-20
23   4      20210518 2022-01-10
24   4      20210519 2022-02-10
25   4      20210520 2022-03-10
26   4      20210618 2022-04-10
27   4      20210718 2022-05-10
28   4      20210818 2022-06-10
29   5      20210518 2022-01-10
30   5      20210631 2022-12-12
31   5      20210818 2022-02-10
32   5      20210918 2022-03-10
33   5      20211018 2022-04-10
34   5      20211113 2022-05-10
35   5      20211218 2022-06-10
40  6T      20200102 2020-11-01
41  6T      20200102 2020-11-01
42  6T      20200102 2020-11-01
43  6T      20210405 2022-06-10
44  6T      20210606 2022-06-11

Upvotes: 1

Related Questions