Allen
Allen

Reputation: 31

Combining 2 dataframes of different lengths by nearest observed datetime index

I am currently trying to find a way to merge specific rows of df2 to df1 based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2) to my main dataset (df1) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?

df1

index                 a   b
2015-06-02 16:00:00   0   5
2015-06-05 16:00:00   1   6
2015-06-06 16:00:00   2   7
2015-06-11 16:00:00   3   8
2015-06-12 16:00:00   4   9

df2

index                 c   d
2015-06-02 9:03:00    10   16
2015-06-02 15:12:00   11   17
2015-06-02 16:07:00   12   18
                ...   ...  ...
2015-06-12 15:29:00   13   19
2015-06-12 16:02:00   14   20
2015-06-12 17:33:00   15   21

df_combined (because you can't see the rows at 06-05, 06-06, 06-11, I just have NaN as the row values to make it easier to interpret)

index                 a   b   c   d
2015-06-02 16:00:00   0   5   11   17   
2015-06-05 16:00:00   1   NaN NaN NaN
2015-06-06 16:00:00   2   NaN NaN NaN
2015-06-11 16:00:00   3   NaN NaN NaN
2015-06-12 16:00:00   4   9   13   19

df_combined.loc[0, ['c', 'd']] and df_combined.loc[4, ['c', 'd']] are 11,17 and 13,19 respectively instead of 12,18 and 14,20 to avoid lookahead bias because in a live scenario, those values haven't been observed yet.

Upvotes: 1

Views: 309

Answers (1)

Ben.T
Ben.T

Reputation: 29635

IIUC, you need merge_asof. assuming your index are ordered in time, it is with the direction backward.

print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
#                      a  b   c   d
# 2015-06-02 16:00:00  0  5  11  17
# 2015-06-05 16:00:00  1  6  12  18
# 2015-06-06 16:00:00  2  7  12  18
# 2015-06-11 16:00:00  3  8  12  18
# 2015-06-12 16:00:00  4  9  13  19

Note that the dates 06-05, 06-06, 06-11 are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00) being available before these dates in your given data.

Note: if what your dates are actually a column named index and not your index, then do:

print(pd.merge_asof(df1, df2, on='index', direction='backward'))

Upvotes: 1

Related Questions