Reputation: 333
Merge 2 dataframes.
I have dataframe A that has a value in a non constant timestamp:
creation_time value_A
2020-09-22 00:00:35 83.0
2020-09-22 00:00:43 83.0
2020-09-22 00:02:20 82.0
2020-09-22 00:03:09 77.0
2020-09-22 00:03:04 77.0
2020-09-22 00:03:44 77.0
2020-09-22 00:07:10 71.0
...
2020-10-23 11:42:31 136.0
2020-10-23 11:42:32 136.0
2020-10-23 11:42:35 136.0
2020-10-29 11:42:31 136.0
2020-10-29 11:42:32 136.0
2020-10-29 11:42:35 136.0
and dataframe B. The first column is a range. For example, from 0 to 75 on 2020-09-22 the value_B should be 60.
value_A_range creation_time value_B
0 2020-09-22 00:00:00.000000 60
75 2020-09-22 00:00:00.000000 65
124 2020-09-22 00:00:00.000000 300
143 2020-09-22 00:00:00.000000 360
0 2020-10-16 12:23:25.000000 60
75 2020-10-16 12:23:25.000000 400
124 2020-10-16 12:23:25.000000 400
143 2020-10-16 12:23:25.000000 450
0 2020-10-28 15:53:31.000000 10
82 2020-10-28 15:53:31.000000 30
114 2020-10-28 15:53:31.000000 40
129 2020-10-28 15:53:31.000000 60
139 2020-10-28 15:53:31.000000 110
Expected result: Get value_B for each timestamp in dataframe A. Match on range for value_A on the available timestamp.
creation_time value_A value_B
2020-09-22 00:00:35 83.0 65
2020-09-22 00:00:43 83.0 65
2020-09-22 00:02:20 82.0 65
2020-09-22 00:03:09 60.0 60
2020-09-22 00:03:04 60.0 60
2020-09-22 00:03:44 60.0 60
2020-09-22 00:07:10 129.0 300
...
2020-10-23 11:42:31 136.0 400
2020-10-23 11:42:32 156.0 450
2020-10-23 11:42:35 136.0 400
2020-10-29 11:42:31 85.0 30
2020-10-29 11:42:32 120.0 40
2020-10-29 11:42:35 160.0 110
Resources I'm trying: range and merge_asof
Upvotes: 1
Views: 301
Reputation: 3001
Here is an approach based on iterating over Dataframe B. Iteration is discouraged in the pandas documentation, but this enables flexible matching logic:
from io import StringIO
import pandas as pd
# create data frames
df_a = pd.read_csv(StringIO(data_a), sep='\s\s+',
engine='python', parse_dates=['creation_time'])
df_b = pd.read_csv(StringIO(data_b), sep='\s\s+',
engine='python', parse_dates=['creation_time'])
# create destination column in data frame A
df_a['value_B'] = None
# pull top of range from next row, and provide missing value (last row only)
df_b['ubound'] = df_b['value_A_range'].shift(-1).fillna(2 ** 30).astype(int)
# process each row of dataframe B
# one row of B might populate multiple rows of A
for row in df_b.itertuples():
m1 = row.value_A_range <= df_a['value_A']
m2 = df_a['value_A'] < row.ubound
m3 = df_a['creation_time'].dt.date == row.creation_time.date()
df_a.loc[m1 & m2 & m3, 'value_B'] = row.value_B
if not (m1 & m2 & m3).any():
print(f'no match: {row.Index:4d})
Upvotes: 1
Reputation: 518
dA = {'creation_time': ['2020-09-22 00:00:35',
'2020-09-22 00:00:43',
'2020-09-22 00:02:20',
'2020-09-22 00:03:09',
'2020-09-22 00:03:04'],
'value_A': [83.0, 83.0, 82.0, 77.0, 77.0]}
dfA =pd.DataFrame(data=dA)
dfA['creation_time'] = pd.to_datetime(dfA['creation_time'])
dfA.sort_values(by=['value_A'], inplace=True)
dB = {'value_A_range': [0, 75, 124, 143, 0],
'creation_time': ['2020-09-22 00:00:00.000000',
'2020-09-22 00:00:00.000000',
'2020-09-22 00:00:00.000000',
'2020-09-22 00:00:00.000000',
'2020-10-16 12:23:25.000000'],
'value_B': [60, 65, 300, 360, 60],}
dfB =pd.DataFrame(data=dB)
dfB['creation_time'] = pd.to_datetime(dfB['creation_time'])
dfB['value_A_range'] = dfB['value_A_range'].astype(float)
dfB.sort_values(by=['value_A_range'], inplace=True)
df = pd.merge_asof(dfA, dfB, left_on='value_A', right_on='value_A_range', suffixes=('','_old'))
df.drop(columns=['value_A_range','creation_time_old'], inplace=True)
df.sort_values(by=['creation_time'], inplace=True)
creation_time value_A value_B
3 2020-09-22 00:00:35 83.0 65
4 2020-09-22 00:00:43 83.0 65
2 2020-09-22 00:02:20 82.0 65
1 2020-09-22 00:03:04 77.0 65
0 2020-09-22 00:03:09 77.0 65
Upvotes: 1
Reputation: 1174
I put the two dataframe in two files A.csv and B.csv
import pandas as pd
df_a = pd.read_csv("A.csv", parse_dates=["creation_time"])
df_b = pd.read_csv("B.csv", parse_dates=["creation_time"])
After loading the data, I calculate the start and the end of the time interval in B
df_b["A_end"] = df_b["creation_time"] + pd.to_timedelta(df_b["value_A_range"], unit="sec")
df_b["A_start"] = df_b["A_end"].apply(lambda cell: df_b[df_b["A_end"]<cell]["A_end"].max())
Before searching the value_B out of B (there should be only one value, therefore it doesn't matter if you use min, max or mean)
df_a["value_B"] = df_a["creation_time"].apply(
lambda cell: df_b[(df_b["A_start"]<cell) & (df_b["A_end"]>=cell)]["value_B"].mean()
)
Upvotes: 1