Reputation: 488
I have a dataset in this format: (can be download in CSV format from here)
ID DateAcquired DateSent
1 20210518 20220110
1 20210719 20220210
1 20210719 20220310
1 20200420 20220410
1 20210328 20220510
1 20210518 20220610
2 20210108 20220110
2 20210110 20220210
2 20210119 20220310
2 20210108 20220410
2 20200109 20220510
2 20210919 20220610
2 20211214 20220612
2 20210812 20220620
2 20210909 20220630
2 20200102 20220811
2 20200608 20220909
2 20210506 20221005
2 20210130 20221101
3 20210518 20220110
3 20210519 20220210
3 20210520 20220310
3 20210518 20220410
3 20210611 20220510
3 20210521 20220610
3 20210723 20220612
3 20211211 20220620
4 20210518 20220110
4 20210519 20220210
4 20210520 20220310
4 20210618 20220410
4 20210718 20220510
4 20210818 20220610
5 20210518 20220110
5 20210818 20220210
5 20210918 20220310
5 20211018 20220410
5 20211113 20220510
5 20211218 20220610
5 20210631 20221212
6T 20200102 20201101
6T 20200102 20201101
6T 20200102 20201101
6T 20210405 20220610
6T 20210606 20220611
I am doing groupby
:
data.groupby(['ID','DateAcquired'])
For each unique combination of ID
and DateAcquired
, I am only interested in keeping one DateSent
, and that is the newest one. Therefore, in other words, if a unique combination of ID
and DateAcquired
has two DateSent
available, only take the one where DateSent
is the largest/newest. This operation should apply only if ID
is NOT 6T
.
I am out of ideas on how to do this. Is there an easy way of doing it with pandas?
Upvotes: 1
Views: 271
Reputation: 862661
You can filter rows for not equal 6T
and get maximum rows by DateSent
by DataFrameGroupBy.idxmax
and then append 6T
rows to output:
m = df['ID'].ne('6T')
df = (df.loc[df[m].groupby(['ID','DateAcquired'])['DateSent'].idxmax()]
.append(df[~m], ignore_index=True))
Solution with sorting and removing duplicates:
m = df['ID'].ne('6T')
df = (df[m].sort_values(['ID','DateAcquired','DateSent'], ascending=[True, True, False])
.drop_duplicates(subset=['ID','DateAcquired'])
.append(df[~m], ignore_index=True))
Upvotes: 2
Reputation: 34056
Use pd.to_datetime
with Groupby.max
:
In [835]: df.DateSent = pd.to_datetime(df.DateSent, format='%Y%m%d')
In [841]: df[df.ID.ne('6T')].groupby(['ID','DateAcquired'])['DateSent'].max().reset_index().append(df[df.ID.eq('6T')])
Out[841]:
ID DateAcquired DateSent
0 1 20200420 2022-04-10
1 1 20210328 2022-05-10
2 1 20210518 2022-06-10
3 1 20210719 2022-03-10
4 2 20200102 2022-08-11
5 2 20200109 2022-05-10
6 2 20200608 2022-09-09
7 2 20210108 2022-04-10
8 2 20210110 2022-02-10
9 2 20210119 2022-03-10
10 2 20210130 2022-11-01
11 2 20210506 2022-10-05
12 2 20210812 2022-06-20
13 2 20210909 2022-06-30
14 2 20210919 2022-06-10
15 2 20211214 2022-06-12
16 3 20210518 2022-04-10
17 3 20210519 2022-02-10
18 3 20210520 2022-03-10
19 3 20210521 2022-06-10
20 3 20210611 2022-05-10
21 3 20210723 2022-06-12
22 3 20211211 2022-06-20
23 4 20210518 2022-01-10
24 4 20210519 2022-02-10
25 4 20210520 2022-03-10
26 4 20210618 2022-04-10
27 4 20210718 2022-05-10
28 4 20210818 2022-06-10
29 5 20210518 2022-01-10
30 5 20210631 2022-12-12
31 5 20210818 2022-02-10
32 5 20210918 2022-03-10
33 5 20211018 2022-04-10
34 5 20211113 2022-05-10
35 5 20211218 2022-06-10
40 6T 20200102 2020-11-01
41 6T 20200102 2020-11-01
42 6T 20200102 2020-11-01
43 6T 20210405 2022-06-10
44 6T 20210606 2022-06-11
Upvotes: 1