Reputation: 391
I have two data frames to merge. Their timestamp frequency is ~5min but slightly off. There are dropouts in either data frame.
I'd tried to merge/join/align the two but I have issues with each approach. I need to make sure the data is correct (so would rather drop any missing point from both data frames) and I would like to get as much data as possible (e.g. if they are off for < 5min, I would still like to merge those values).
Using
df['Time'] = pd.to_datetime(df['Time'], errors='coerce')
df['Time'] = df['Time'].dt.round('1min')
with 1min and 5min rounding interval did not help.
Both pd.merge
and pd.join
drop too many data points due to the misalignment. I'm not sure if np.isclose
is easy to adjust for time to float and back to time series.
Just for illustration, here is how data frames approximately look like (the first row is the value, the second is the time index):
part of df1
10 2018-08-01 00:59:00
11 2018-08-01 01:04:00
12 2018-08-01 01:09:00
13 2018-08-01 01:14:00
14 2018-08-01 01:19:00
15 2018-08-01 01:24:00
16 2018-08-01 01:29:00
17 2018-08-01 01:34:00
18 2018-08-01 01:39:00
19 2018-08-01 01:44:00
110 2018-08-01 01:49:00
111 2018-08-01 01:54:00
112 2018-08-01 02:04:00
part of df2
20 2018-08-01 01:01:00
21 2018-08-01 01:06:00
22 2018-08-01 01:11:00
23 2018-08-01 01:16:00
24 2018-08-01 01:26:00
25 2018-08-01 01:36:00
26 2018-08-01 01:46:00
27 2018-08-01 01:51:00
28 2018-08-01 01:56:00
29 2018-08-01 02:01:00
Expected output (the first row is for time reference illustration):
00 2018-08-01 01:01:00 20 10
05 2018-08-01 01:06:00 21 11
10 2018-08-01 01:11:00 22 12
15 2018-08-01 01:16:00 23 13
20 missing df2 - skip
25 2018-08-01 01:26:00 24 15
30 missing df2 - skip
35 2018-08-01 01:36:00 25 17
40 missing df2 - skip
45 2018-08-01 01:46:00 26 19
50 2018-08-01 01:51:00 27 110
55 2018-08-01 01:56:00 28 111
60 missing in df1 - skip
Code to create df1 and df2:
df1 = pd.DataFrame({'val' : ['10 ', '11 ', '12 ', '13 ', '14 ', '15 ', '16 ', '17 ', '18 ', '19 ', '110', '111', '112']}, index= ['2018-08-01 00:59:00', '2018-08-01 01:04:00', '2018-08-01 01:09:00', '2018-08-01 01:14:00', '2018-08-01 01:19:00', '2018-08-01 01:24:00', '2018-08-01 01:29:00', '2018-08-01 01:34:00', '2018-08-01 01:39:00', '2018-08-01 01:44:00', '2018-08-01 01:49:00', '2018-08-01 01:54:00', '2018-08-01 02:04:00'])
df2 = pd.DataFrame({'val' :['20', '21', '22', '23', '24', '25', '26', '27', '28', '29']}, index= ['2018-08-01 01:01:00', '2018-08-01 01:06:00', '2018-08-01 01:11:00', '2018-08-01 01:16:00', '2018-08-01 01:26:00', '2018-08-01 01:36:00', '2018-08-01 01:46:00', '2018-08-01 01:51:00', '2018-08-01 01:56:00', '2018-08-01 02:01:00'])
I've tried many different ways/methods/options but most of them drop too much data or set most values to NaN.
Upvotes: 2
Views: 1777
Reputation: 323366
Try with merge_asof
df1.index=pd.to_datetime(df1.index)
df2.index=pd.to_datetime(df2.index)
pd.merge_asof(df2.reset_index(),df1.reset_index(),on='index',direction = 'nearest',tolerance =pd.Timedelta('5 min'))
Out[73]:
index val_x val_y
0 2018-08-01 01:01:00 20 10
1 2018-08-01 01:06:00 21 11
2 2018-08-01 01:11:00 22 12
3 2018-08-01 01:16:00 23 13
4 2018-08-01 01:26:00 24 15
5 2018-08-01 01:36:00 25 17
6 2018-08-01 01:46:00 26 19
7 2018-08-01 01:51:00 27 110
8 2018-08-01 01:56:00 28 111
9 2018-08-01 02:01:00 29 112
Upvotes: 2