hebeha
hebeha

Reputation: 391

Merge pandas time series with time tolerance

I have two data frames to merge. Their timestamp frequency is ~5min but slightly off. There are dropouts in either data frame.

I'd tried to merge/join/align the two but I have issues with each approach. I need to make sure the data is correct (so would rather drop any missing point from both data frames) and I would like to get as much data as possible (e.g. if they are off for < 5min, I would still like to merge those values).

Using

df['Time'] = pd.to_datetime(df['Time'], errors='coerce')
df['Time'] = df['Time'].dt.round('1min')

with 1min and 5min rounding interval did not help.

Both pd.merge and pd.join drop too many data points due to the misalignment. I'm not sure if np.isclose is easy to adjust for time to float and back to time series.

Just for illustration, here is how data frames approximately look like (the first row is the value, the second is the time index):

part of df1

10  2018-08-01 00:59:00
11  2018-08-01 01:04:00
12  2018-08-01 01:09:00
13  2018-08-01 01:14:00
14  2018-08-01 01:19:00
15  2018-08-01 01:24:00
16  2018-08-01 01:29:00
17  2018-08-01 01:34:00
18  2018-08-01 01:39:00
19  2018-08-01 01:44:00
110  2018-08-01 01:49:00
111  2018-08-01 01:54:00
112  2018-08-01 02:04:00

part of df2

20  2018-08-01 01:01:00
21  2018-08-01 01:06:00
22  2018-08-01 01:11:00
23  2018-08-01 01:16:00
24  2018-08-01 01:26:00
25  2018-08-01 01:36:00
26  2018-08-01 01:46:00
27  2018-08-01 01:51:00
28  2018-08-01 01:56:00
29  2018-08-01 02:01:00

Expected output (the first row is for time reference illustration):

00  2018-08-01 01:01:00 20 10
05  2018-08-01 01:06:00 21 11
10  2018-08-01 01:11:00 22 12
15  2018-08-01 01:16:00 23 13
20  missing df2 - skip
25  2018-08-01 01:26:00 24 15
30  missing df2 - skip
35  2018-08-01 01:36:00 25 17
40  missing df2 - skip
45  2018-08-01 01:46:00 26 19
50  2018-08-01 01:51:00 27 110
55  2018-08-01 01:56:00 28 111
60  missing in df1 - skip

Code to create df1 and df2:

df1 = pd.DataFrame({'val' : ['10 ', '11 ', '12 ', '13 ', '14 ', '15 ', '16 ', '17 ', '18 ', '19 ', '110', '111', '112']}, index= ['2018-08-01 00:59:00', '2018-08-01 01:04:00', '2018-08-01 01:09:00', '2018-08-01 01:14:00', '2018-08-01 01:19:00', '2018-08-01 01:24:00', '2018-08-01 01:29:00', '2018-08-01 01:34:00', '2018-08-01 01:39:00', '2018-08-01 01:44:00', '2018-08-01 01:49:00', '2018-08-01 01:54:00', '2018-08-01 02:04:00'])
df2 = pd.DataFrame({'val' :['20', '21', '22', '23', '24', '25', '26', '27', '28', '29']}, index= ['2018-08-01 01:01:00', '2018-08-01 01:06:00', '2018-08-01 01:11:00', '2018-08-01 01:16:00', '2018-08-01 01:26:00', '2018-08-01 01:36:00', '2018-08-01 01:46:00', '2018-08-01 01:51:00', '2018-08-01 01:56:00', '2018-08-01 02:01:00'])

I've tried many different ways/methods/options but most of them drop too much data or set most values to NaN.

Upvotes: 2

Views: 1777

Answers (1)

BENY
BENY

Reputation: 323366

Try with merge_asof

df1.index=pd.to_datetime(df1.index)
df2.index=pd.to_datetime(df2.index)
pd.merge_asof(df2.reset_index(),df1.reset_index(),on='index',direction = 'nearest',tolerance =pd.Timedelta('5 min'))
Out[73]: 
                index val_x val_y
0 2018-08-01 01:01:00    20   10 
1 2018-08-01 01:06:00    21   11 
2 2018-08-01 01:11:00    22   12 
3 2018-08-01 01:16:00    23   13 
4 2018-08-01 01:26:00    24   15 
5 2018-08-01 01:36:00    25   17 
6 2018-08-01 01:46:00    26   19 
7 2018-08-01 01:51:00    27   110
8 2018-08-01 01:56:00    28   111
9 2018-08-01 02:01:00    29   112

Upvotes: 2

Related Questions