Python Pandas: Complex Subset from Dataframe

Question

I have a dataframe with Groups, two dates, and a value.

I want a subset of the dataframe which keeps all rows with a unique B_DATE value for each GRP. Where there are duplicate B_DATE values within each group, I want to keep the rows with the maximum A_DATE value.

So, if my initial dataframe were:

GRP  A_DATE        B_DATE   VALUE
A   12/31/2012  2/19/2014   546.2
A   12/31/2013  2/19/2014   543.7
A   3/31/2013   4/30/2014   473.3
A   3/31/2014   4/30/2014   472.5
A   6/30/2013   7/30/2014   528.7
A   6/30/2014   7/30/2014   531.5
A   9/30/2013   10/30/2014  529
A   9/30/2014   10/30/2014  546.7
A   12/31/2014  2/18/2015   573.5
A   3/31/2015   4/30/2015   458.7
A   6/30/2015   7/30/2015   519.5
B   3/31/2014   7/7/2015    1329
B   12/31/2014  7/7/2015    1683
B   3/31/2015   7/7/2015    1361
B   6/30/2014   8/13/2015   1452
B   6/30/2015   8/13/2015   1429
B   9/30/2014   10/29/2015  1488
B   9/30/2015   10/29/2015  1595
B   12/31/2015  2/16/2016   1763
B   3/31/2016   4/28/2016   1548

I would want the result to look like this:

GRP  A_DATE        B_DATE   VALUE
A   12/31/2013  2/19/2014   543.7
A   3/31/2014   4/30/2014   472.5
A   6/30/2014   7/30/2014   531.5
A   9/30/2014   10/30/2014  546.7
A   12/31/2014  2/18/2015   573.5
A   3/31/2015   4/30/2015   458.7
A   6/30/2015   7/30/2015   519.5
B   3/31/2015   7/7/2015    1361
B   6/30/2015   8/13/2015   1429
B   9/30/2015   10/29/2015  1595
B   12/31/2015  2/16/2016   1763
B   3/31/2016   4/28/2016   1548

I know how to do this through cumbersome looping and using argmax(). However, wondering if there is a 'clean', efficient, Pythonic way to approach.

Thanks in advance.

Scott Boston · Accepted Answer

Let's use sort_values and drop_duplicates:

df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
  .drop_duplicates(subset=['GRP','B_DATE'])

Output:

   GRP      A_DATE      B_DATE   VALUE
7    A   9/30/2014  10/30/2014   546.7
10   A   6/30/2015   7/30/2015   519.5
5    A   6/30/2014   7/30/2014   531.5
9    A   3/31/2015   4/30/2015   458.7
3    A   3/31/2014   4/30/2014   472.5
8    A  12/31/2014   2/18/2015   573.5
1    A  12/31/2013   2/19/2014   543.7
17   B   9/30/2015  10/29/2015  1595.0
15   B   6/30/2015   8/13/2015  1429.0
19   B   3/31/2016   4/28/2016  1548.0
13   B   3/31/2015    7/7/2015  1361.0
18   B  12/31/2015   2/16/2016  1763.0

And, add sort_index to get back original order:

df.sort_values(['GRP','A_DATE'], ascending=[True,False])\
  .drop_duplicates(subset=['GRP','B_DATE']).sort_index()

   GRP      A_DATE      B_DATE   VALUE
1    A  12/31/2013   2/19/2014   543.7
3    A   3/31/2014   4/30/2014   472.5
5    A   6/30/2014   7/30/2014   531.5
7    A   9/30/2014  10/30/2014   546.7
8    A  12/31/2014   2/18/2015   573.5
9    A   3/31/2015   4/30/2015   458.7
10   A   6/30/2015   7/30/2015   519.5
13   B   3/31/2015    7/7/2015  1361.0
15   B   6/30/2015   8/13/2015  1429.0
17   B   9/30/2015  10/29/2015  1595.0
18   B  12/31/2015   2/16/2016  1763.0
19   B   3/31/2016   4/28/2016  1548.0

Python Pandas: Complex Subset from Dataframe

Answers (2)

Related Questions