Reputation: 2680
I have a dataframe with Groups, two dates, and a value.
I want a subset of the dataframe which keeps all rows with a unique B_DATE value for each GRP. Where there are duplicate B_DATE values within each group, I want to keep the rows with the maximum A_DATE value.
So, if my initial dataframe were:
GRP A_DATE B_DATE VALUE
A 12/31/2012 2/19/2014 546.2
A 12/31/2013 2/19/2014 543.7
A 3/31/2013 4/30/2014 473.3
A 3/31/2014 4/30/2014 472.5
A 6/30/2013 7/30/2014 528.7
A 6/30/2014 7/30/2014 531.5
A 9/30/2013 10/30/2014 529
A 9/30/2014 10/30/2014 546.7
A 12/31/2014 2/18/2015 573.5
A 3/31/2015 4/30/2015 458.7
A 6/30/2015 7/30/2015 519.5
B 3/31/2014 7/7/2015 1329
B 12/31/2014 7/7/2015 1683
B 3/31/2015 7/7/2015 1361
B 6/30/2014 8/13/2015 1452
B 6/30/2015 8/13/2015 1429
B 9/30/2014 10/29/2015 1488
B 9/30/2015 10/29/2015 1595
B 12/31/2015 2/16/2016 1763
B 3/31/2016 4/28/2016 1548
I would want the result to look like this:
GRP A_DATE B_DATE VALUE
A 12/31/2013 2/19/2014 543.7
A 3/31/2014 4/30/2014 472.5
A 6/30/2014 7/30/2014 531.5
A 9/30/2014 10/30/2014 546.7
A 12/31/2014 2/18/2015 573.5
A 3/31/2015 4/30/2015 458.7
A 6/30/2015 7/30/2015 519.5
B 3/31/2015 7/7/2015 1361
B 6/30/2015 8/13/2015 1429
B 9/30/2015 10/29/2015 1595
B 12/31/2015 2/16/2016 1763
B 3/31/2016 4/28/2016 1548
I know how to do this through cumbersome looping and using argmax(). However, wondering if there is a 'clean', efficient, Pythonic way to approach.
Thanks in advance.
Upvotes: 0
Views: 51
Reputation: 153500
Let's use sort_values
and drop_duplicates
:
df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
.drop_duplicates(subset=['GRP','B_DATE'])
Output:
GRP A_DATE B_DATE VALUE
7 A 9/30/2014 10/30/2014 546.7
10 A 6/30/2015 7/30/2015 519.5
5 A 6/30/2014 7/30/2014 531.5
9 A 3/31/2015 4/30/2015 458.7
3 A 3/31/2014 4/30/2014 472.5
8 A 12/31/2014 2/18/2015 573.5
1 A 12/31/2013 2/19/2014 543.7
17 B 9/30/2015 10/29/2015 1595.0
15 B 6/30/2015 8/13/2015 1429.0
19 B 3/31/2016 4/28/2016 1548.0
13 B 3/31/2015 7/7/2015 1361.0
18 B 12/31/2015 2/16/2016 1763.0
And, add sort_index
to get back original order:
df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
.drop_duplicates(subset=['GRP','B_DATE']).sort_index()
GRP A_DATE B_DATE VALUE
1 A 12/31/2013 2/19/2014 543.7
3 A 3/31/2014 4/30/2014 472.5
5 A 6/30/2014 7/30/2014 531.5
7 A 9/30/2014 10/30/2014 546.7
8 A 12/31/2014 2/18/2015 573.5
9 A 3/31/2015 4/30/2015 458.7
10 A 6/30/2015 7/30/2015 519.5
13 B 3/31/2015 7/7/2015 1361.0
15 B 6/30/2015 8/13/2015 1429.0
17 B 9/30/2015 10/29/2015 1595.0
18 B 12/31/2015 2/16/2016 1763.0
19 B 3/31/2016 4/28/2016 1548.0
Upvotes: 3
Reputation: 30605
I think you want to groupby B_DATE
and 'GRP' aggregate the last value i.e
df['A_DATE'] = pd.to_datetime(df['A_DATE'])
df['B_DATE'] = pd.to_datetime(df['B_DATE'])
ndf = df.groupby(['GRP',df['B_DATE']]).agg('last').reset_index()
GRP B_DATE A_DATE VALUE 0 A 2014-02-19 2013-12-31 543.7 1 A 2014-04-30 2014-03-31 472.5 2 A 2014-07-30 2014-06-30 531.5 3 A 2014-10-30 2014-09-30 546.7 4 A 2015-02-18 2014-12-31 573.5 5 A 2015-04-30 2015-03-31 458.7 6 A 2015-07-30 2015-06-30 519.5 7 B 2015-07-07 2015-03-31 1361.0 8 B 2015-08-13 2015-06-30 1429.0 9 B 2015-10-29 2015-09-30 1595.0 10 B 2016-02-16 2015-12-31 1763.0 11 B 2016-04-28 2016-03-31 1548.0
Upvotes: 1