Reputation: 31
I am currently trying to find a way to merge specific rows of df2
to df1
based on their datetime indices in a way that avoids lookahead bias so that I can add external features (df2
) to my main dataset (df1
) for ML applications. The lengths of the dataframes are different, and the datetime indices aren't increasing at a constant rate. My current thought process is to do this by using nested loops and if statements, but this method would be too slow as the dataframes I am trying to do this on both have over 30000 rows each. Is there a faster way of doing this?
df1
index a b
2015-06-02 16:00:00 0 5
2015-06-05 16:00:00 1 6
2015-06-06 16:00:00 2 7
2015-06-11 16:00:00 3 8
2015-06-12 16:00:00 4 9
df2
index c d
2015-06-02 9:03:00 10 16
2015-06-02 15:12:00 11 17
2015-06-02 16:07:00 12 18
... ... ...
2015-06-12 15:29:00 13 19
2015-06-12 16:02:00 14 20
2015-06-12 17:33:00 15 21
df_combined
(because you can't see the rows at 06-05
, 06-06
, 06-11
, I just have NaN
as the row values to make it easier to interpret)
index a b c d
2015-06-02 16:00:00 0 5 11 17
2015-06-05 16:00:00 1 NaN NaN NaN
2015-06-06 16:00:00 2 NaN NaN NaN
2015-06-11 16:00:00 3 NaN NaN NaN
2015-06-12 16:00:00 4 9 13 19
df_combined.loc[0, ['c', 'd']]
and df_combined.loc[4, ['c', 'd']]
are 11,17
and 13,19
respectively instead of 12,18
and 14,20
to avoid lookahead bias because in a live scenario, those values haven't been observed yet.
Upvotes: 1
Views: 309
Reputation: 29635
IIUC, you need merge_asof
. assuming your index are ordered in time, it is with the direction backward.
print(pd.merge_asof(df1, df2, left_index=True, right_index=True, direction='backward'))
# a b c d
# 2015-06-02 16:00:00 0 5 11 17
# 2015-06-05 16:00:00 1 6 12 18
# 2015-06-06 16:00:00 2 7 12 18
# 2015-06-11 16:00:00 3 8 12 18
# 2015-06-12 16:00:00 4 9 13 19
Note that the dates 06-05
, 06-06
, 06-11
are not NaN but it is the last values in df2 (for 2015-06-02 16:07:00
) being available before these dates in your given data.
Note: if what your dates are actually a column named index and not your index, then do:
print(pd.merge_asof(df1, df2, on='index', direction='backward'))
Upvotes: 1