Reputation: 7404
I have some data for some experiments indexed by a subject ID and a date. I'd like to join the data together, but the subjects may undergo experiments on different days. Here is an example of what I mean. Shown below are the results from two different experiments
SubjectID Date ScoreA
1 2016-09-20 10
1 2016-09-21 12
1 2016-12-01 11
SubjectID Date ScoreB
1 2016-09-20 1
1 2016-09-24 5
1 2016-11-28 3
1 2016-12-11 9
I would like to join the rows to the closest available date. So ideally, my desired output is
SubjectID Date1 Date2 ScoreA ScoreB
1 2016-09-20 2016-09-20 10 1
1 2016-09-21 2016-09-24 12 5
1 2016-12-01 2016-11-28 11 3
Note "closest date" is closest in absolute value. How can I achieve something like this?
Upvotes: 3
Views: 1155
Reputation: 8269
I don't know if there is a way to do what you want with default pandas functionality, but it's straightforward to do it with a custom aggregation function:
def pick_closest(g):
closest_date_loc = (g.Date1 - g.Date2).abs().argmin()
return g.loc[closest_date_loc, ['ScoreA','Date2','ScoreB']]
merged = df1.merge(df2, on='SubjectID', suffixes=['1', '2'])
df3 = merged.groupby(['SubjectID','Date1'], as_index=False).apply(pick_closest).reset_index()
df3
SubjectID Date1 ScoreA Date2 ScoreB
0 1 2016-09-20 10 2016-09-20 1
1 1 2016-09-21 12 2016-09-20 1
2 1 2016-12-01 11 2016-11-28 3
In this code snippet, the two frames are initially merged on SubjectID
, generating all possible combinations of Date1
and Date2
. Then the pick_closest
function selects the row with the smallest date difference between Date1
and Date2
for each SubjectID
/Date1
group.
Upvotes: 2