Reputation: 1657
here's a sample dataset i've created for this question:
data1 = pd.DataFrame([['1','303','3/7/2016'],
['4','404','6/23/2011'],
['7','101','3/7/2016'],
['1','303','5/6/2017']],
columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
['4','404','6/23/2011'],
['7','101','3/17/2016'],
['1','303','4/6/2017']],
columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
code ticket # CB date
0 1 303 2016-03-07
1 4 404 2011-06-23
2 7 101 2016-03-07
3 1 303 2017-05-06
code ticket # audit date
0 1 303 2016-02-05
1 4 404 2011-06-23
2 7 101 2016-03-17
3 1 303 2017-04-06
I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:
data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.
Desired output:
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?
I read other posts here but I am still not getting it, so would really appreciate some advice here.
Few posts I read into include:
Upvotes: 3
Views: 7852
Reputation: 402493
I'd do this with an optional sort_values
call and a drop_duplicates
call.
data_all.sort_values(data_all.columns.tolist())\
.drop_duplicates(subset=['CB date'], keep='first')
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
4 4 404 2011-06-23 2011-06-23
I say the sort_values
call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.
Upvotes: 2