awwlaz
awwlaz

Reputation: 41

How to merge two dataframes based on the closest (or most recent) timestamp

Suppose I have a dataframe df1, with columns 'A' and 'B'. A is a column of timestamps (e.g. unixtime) and 'B' is a column of some value.

Suppose I also have a dataframe df2 with columns 'C' and 'D'. C is also a unixtime column and D is a column containing some other values.

I would like to fuzzy merge the dataframes with a join on the timestamp. However, if the timestamps don't match (which they most likely don't), I would like it to merge on the closest entry before the timestamp in 'A' that it can find in 'C'.

pd.merge does not support this, and I find myself converting away from dataframes using to_dict(), and using some iteration to solve this. Is there a way in pandas to solve this?

Upvotes: 4

Views: 3321

Answers (2)

philngo
philngo

Reputation: 931

Building on @Stephan's answer and @JohnE's comment, something similar can be done with pandas.merge_asof for pandas>=0.19.0:

>>> import numpy as np
>>> import pandas as pd
>>> from datetime import datetime, timedelta
>>> a_timestamps = pd.date_range(start, start + timedelta(hours=4.5), freq='30Min')
>>> c_timestamps = pd.date_range(start, start + timedelta(hours=9), freq='H')
>>> df1 = pd.DataFrame({'A': a_timestamps, 'B': range(10)})

                    A  B
0 2015-12-01 00:00:00  0
1 2015-12-01 00:30:00  1
2 2015-12-01 01:00:00  2
3 2015-12-01 01:30:00  3
4 2015-12-01 02:00:00  4
5 2015-12-01 02:30:00  5
6 2015-12-01 03:00:00  6
7 2015-12-01 03:30:00  7
8 2015-12-01 04:00:00  8
9 2015-12-01 04:30:00  9

>>> df2 = pd.DataFrame({'C': c_timestamps, 'D': range(10, 20)})

                   C   D
0 2015-12-01 00:00:00  10
1 2015-12-01 01:00:00  11
2 2015-12-01 02:00:00  12
3 2015-12-01 03:00:00  13
4 2015-12-01 04:00:00  14
5 2015-12-01 05:00:00  15
6 2015-12-01 06:00:00  16
7 2015-12-01 07:00:00  17
8 2015-12-01 08:00:00  18
9 2015-12-01 09:00:00  19

>>> pd.merge_asof(left=df1, right=df2, left_on='A', right_on='C')

                    A  B                   C   D
0 2015-12-01 00:00:00  0 2015-12-01 00:00:00  10
1 2015-12-01 00:30:00  1 2015-12-01 00:00:00  10
2 2015-12-01 01:00:00  2 2015-12-01 01:00:00  11
3 2015-12-01 01:30:00  3 2015-12-01 01:00:00  11
4 2015-12-01 02:00:00  4 2015-12-01 02:00:00  12
5 2015-12-01 02:30:00  5 2015-12-01 02:00:00  12
6 2015-12-01 03:00:00  6 2015-12-01 03:00:00  13
7 2015-12-01 03:30:00  7 2015-12-01 03:00:00  13
8 2015-12-01 04:00:00  8 2015-12-01 04:00:00  14
9 2015-12-01 04:30:00  9 2015-12-01 04:00:00  14

Upvotes: 0

Stefan
Stefan

Reputation: 42885

numpy.searchsorted() finds the appropriate index positions to merge on (see docs) - hope the below get you closer to what you're looking for:

start = datetime(2015, 12, 1)
df1 = pd.DataFrame({'A': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'B': [1] * 10}).sort_values('A').reset_index(drop=True)
df2 = pd.DataFrame({'C': [start + timedelta(minutes=randrange(60)) for i in range(10)], 'D': [2] * 10}).sort_values('C').reset_index(drop=True)
df2.index = np.searchsorted(df1.A.values, df2.C.values)
print(pd.merge(left=df1, right=df2, left_index=True, right_index=True, how='left'))

                    A  B                   C   D
0 2015-12-01 00:01:00  1                 NaT NaN
1 2015-12-01 00:02:00  1 2015-12-01 00:02:00   2
2 2015-12-01 00:02:00  1                 NaT NaN
3 2015-12-01 00:12:00  1 2015-12-01 00:05:00   2
4 2015-12-01 00:16:00  1 2015-12-01 00:14:00   2
4 2015-12-01 00:16:00  1 2015-12-01 00:14:00   2
5 2015-12-01 00:28:00  1 2015-12-01 00:22:00   2
6 2015-12-01 00:30:00  1                 NaT NaN
7 2015-12-01 00:39:00  1 2015-12-01 00:31:00   2
7 2015-12-01 00:39:00  1 2015-12-01 00:39:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:40:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:46:00   2
8 2015-12-01 00:55:00  1 2015-12-01 00:54:00   2
9 2015-12-01 00:57:00  1                 NaT NaN

Upvotes: 3

Related Questions