Merge dataframe on closest date

Question

I have some data for some experiments indexed by a subject ID and a date. I'd like to join the data together, but the subjects may undergo experiments on different days. Here is an example of what I mean. Shown below are the results from two different experiments

SubjectID  Date        ScoreA
1          2016-09-20      10
1          2016-09-21      12
1          2016-12-01      11

SubjectID  Date        ScoreB
1          2016-09-20      1
1          2016-09-24      5
1          2016-11-28      3
1          2016-12-11      9

I would like to join the rows to the closest available date. So ideally, my desired output is

SubjectID   Date1         Date2        ScoreA ScoreB
1            2016-09-20    2016-09-20    10      1
1            2016-09-21    2016-09-24    12      5
1            2016-12-01    2016-11-28    11      3

Note "closest date" is closest in absolute value. How can I achieve something like this?

foglerit · Accepted Answer

I don't know if there is a way to do what you want with default pandas functionality, but it's straightforward to do it with a custom aggregation function:

def pick_closest(g):
    closest_date_loc = (g.Date1 - g.Date2).abs().argmin()
    return g.loc[closest_date_loc, ['ScoreA','Date2','ScoreB']]

merged = df1.merge(df2, on='SubjectID', suffixes=['1', '2'])
df3  = merged.groupby(['SubjectID','Date1'], as_index=False).apply(pick_closest).reset_index()
df3

   SubjectID      Date1  ScoreA      Date2  ScoreB
0          1 2016-09-20      10 2016-09-20       1
1          1 2016-09-21      12 2016-09-20       1
2          1 2016-12-01      11 2016-11-28       3

In this code snippet, the two frames are initially merged on SubjectID, generating all possible combinations of Date1 and Date2. Then the pick_closest function selects the row with the smallest date difference between Date1 and Date2 for each SubjectID/Date1 group.

Merge dataframe on closest date

Answers (1)

Related Questions