dank
dank

Reputation: 333

Merge dataframe on datetime and range

Merge 2 dataframes.

I have dataframe A that has a value in a non constant timestamp:

creation_time   value_A
2020-09-22 00:00:35 83.0
2020-09-22 00:00:43 83.0
2020-09-22 00:02:20 82.0
2020-09-22 00:03:09 77.0
2020-09-22 00:03:04 77.0
2020-09-22 00:03:44 77.0
2020-09-22 00:07:10 71.0
...
2020-10-23 11:42:31 136.0
2020-10-23 11:42:32 136.0
2020-10-23 11:42:35 136.0
2020-10-29 11:42:31 136.0
2020-10-29 11:42:32 136.0
2020-10-29 11:42:35 136.0

and dataframe B. The first column is a range. For example, from 0 to 75 on 2020-09-22 the value_B should be 60.

value_A_range  creation_time value_B
0   2020-09-22 00:00:00.000000  60
75  2020-09-22 00:00:00.000000  65
124 2020-09-22 00:00:00.000000  300
143 2020-09-22 00:00:00.000000  360
0   2020-10-16 12:23:25.000000  60
75  2020-10-16 12:23:25.000000  400
124 2020-10-16 12:23:25.000000  400
143 2020-10-16 12:23:25.000000  450
0   2020-10-28 15:53:31.000000  10
82  2020-10-28 15:53:31.000000  30
114 2020-10-28 15:53:31.000000  40
129 2020-10-28 15:53:31.000000  60
139 2020-10-28 15:53:31.000000  110

Expected result: Get value_B for each timestamp in dataframe A. Match on range for value_A on the available timestamp.

creation_time   value_A  value_B
2020-09-22 00:00:35 83.0 65   
2020-09-22 00:00:43 83.0 65
2020-09-22 00:02:20 82.0 65
2020-09-22 00:03:09 60.0 60
2020-09-22 00:03:04 60.0 60
2020-09-22 00:03:44 60.0 60
2020-09-22 00:07:10 129.0 300
...
2020-10-23 11:42:31 136.0 400
2020-10-23 11:42:32 156.0 450
2020-10-23 11:42:35 136.0 400
2020-10-29 11:42:31 85.0 30
2020-10-29 11:42:32 120.0 40
2020-10-29 11:42:35 160.0 110

Resources I'm trying: range and merge_asof

Upvotes: 1

Views: 301

Answers (3)

jsmart
jsmart

Reputation: 3001

Here is an approach based on iterating over Dataframe B. Iteration is discouraged in the pandas documentation, but this enables flexible matching logic:

from io import StringIO
import pandas as pd

# create data frames
df_a = pd.read_csv(StringIO(data_a), sep='\s\s+', 
                   engine='python', parse_dates=['creation_time'])
df_b = pd.read_csv(StringIO(data_b), sep='\s\s+', 
                   engine='python', parse_dates=['creation_time'])

# create destination column in data frame A
df_a['value_B'] = None

# pull top of range from next row, and provide missing value (last row only)
df_b['ubound'] = df_b['value_A_range'].shift(-1).fillna(2 ** 30).astype(int)

# process each row of dataframe B
# one row of B might populate multiple rows of A

for row in df_b.itertuples():
    m1 = row.value_A_range <= df_a['value_A']
    m2 = df_a['value_A'] < row.ubound
    m3 = df_a['creation_time'].dt.date == row.creation_time.date()
    df_a.loc[m1 & m2 & m3, 'value_B'] = row.value_B
    
    if not (m1 & m2 & m3).any():
        print(f'no match: {row.Index:4d})

Upvotes: 1

winderland
winderland

Reputation: 518

Create example data. Make sure the merge columns are sorted.

dA = {'creation_time': ['2020-09-22 00:00:35', 
                       '2020-09-22 00:00:43', 
                       '2020-09-22 00:02:20',
                       '2020-09-22 00:03:09',
                       '2020-09-22 00:03:04'],
     'value_A': [83.0, 83.0, 82.0, 77.0, 77.0]}
dfA =pd.DataFrame(data=dA)
dfA['creation_time'] = pd.to_datetime(dfA['creation_time'])
dfA.sort_values(by=['value_A'], inplace=True)

dB = {'value_A_range': [0, 75, 124, 143, 0],
     'creation_time': ['2020-09-22 00:00:00.000000', 
                       '2020-09-22 00:00:00.000000', 
                       '2020-09-22 00:00:00.000000', 
                       '2020-09-22 00:00:00.000000', 
                       '2020-10-16 12:23:25.000000'],
     'value_B': [60, 65, 300, 360, 60],}
dfB =pd.DataFrame(data=dB)
dfB['creation_time'] = pd.to_datetime(dfB['creation_time'])
dfB['value_A_range'] = dfB['value_A_range'].astype(float)
dfB.sort_values(by=['value_A_range'], inplace=True)

Merge

df = pd.merge_asof(dfA, dfB, left_on='value_A', right_on='value_A_range', suffixes=('','_old'))
df.drop(columns=['value_A_range','creation_time_old'], inplace=True)
df.sort_values(by=['creation_time'], inplace=True)

Output

        creation_time  value_A  value_B
3 2020-09-22 00:00:35     83.0       65
4 2020-09-22 00:00:43     83.0       65
2 2020-09-22 00:02:20     82.0       65
1 2020-09-22 00:03:04     77.0       65
0 2020-09-22 00:03:09     77.0       65

Upvotes: 1

Fran&#231;ois B.
Fran&#231;ois B.

Reputation: 1174

I put the two dataframe in two files A.csv and B.csv

import pandas as pd

df_a = pd.read_csv("A.csv", parse_dates=["creation_time"])
df_b = pd.read_csv("B.csv", parse_dates=["creation_time"])

After loading the data, I calculate the start and the end of the time interval in B

df_b["A_end"] = df_b["creation_time"] + pd.to_timedelta(df_b["value_A_range"], unit="sec")
df_b["A_start"] = df_b["A_end"].apply(lambda cell: df_b[df_b["A_end"]<cell]["A_end"].max())

Before searching the value_B out of B (there should be only one value, therefore it doesn't matter if you use min, max or mean)

df_a["value_B"] = df_a["creation_time"].apply(
    lambda cell: df_b[(df_b["A_start"]<cell) & (df_b["A_end"]>=cell)]["value_B"].mean()
)

Upvotes: 1

Related Questions