Python Pandas: how to take only the earliest date in each group

Question

here's a sample dataset i've created for this question:

data1 = pd.DataFrame([['1','303','3/7/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/7/2016'],
                     ['1','303','5/6/2017']],
                    columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
                     ['4','404','6/23/2011'],
                     ['7','101','3/17/2016'],
                     ['1','303','4/6/2017']],
                    columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
  code ticket #    CB date
0    1      303 2016-03-07
1    4      404 2011-06-23
2    7      101 2016-03-07
3    1      303 2017-05-06
  code ticket # audit date
0    1      303 2016-02-05
1    4      404 2011-06-23
2    7      101 2016-03-17
3    1      303 2017-04-06

I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:

data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.

Desired output:

code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
3    1      303 2017-05-06 2017-04-06
4    4      404 2011-06-23 2011-06-23

I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?

I read other posts here but I am still not getting it, so would really appreciate some advice here.

Few posts I read into include:

cs95 · Accepted Answer

I'd do this with an optional sort_values call and a drop_duplicates call.

data_all.sort_values(data_all.columns.tolist())\
          .drop_duplicates(subset=['CB date'], keep='first')

  code ticket #    CB date audit date
0    1      303 2016-03-07 2016-02-05
2    1      303 2017-05-06 2016-02-05
4    4      404 2011-06-23 2011-06-23

I say the sort_values call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.

Python Pandas: how to take only the earliest date in each group

Answers (1)

Related Questions