Reputation: 3299
The objective is to combine two df
row wise, if a predetermine condition is met.
Specifically, if the difference between the column is less or equal than a threshold
, then, join the row of the df
.
Given two df
: df1 and df2, the following code partially achieve the objective.
import pandas as pd
df1 = pd.DataFrame ( {'time': [2, 3, 4, 24, 31]} )
df2 = pd.DataFrame ( {'time': [4.1, 24.7, 31.4, 5]} )
th = 0.9
all_comb=[]
for index, row in df1.iterrows ():
for index2, row2 in df2.iterrows ():
diff = abs ( row ['time'] - row2 ['time'] )
if diff <= th:
all_comb.append({'idx_1':index,'time_1':row ['time'], 'idx_2':index2,'time_2':row2 ['time']})
df_all = pd.DataFrame(all_comb)
outputted
idx_1 time_1 idx_2 time_2
0 2 4 0 4.1
1 3 24 1 24.7
2 4 31 2 31.4
However, the above approach ignore certain information i.e., the value of 2 and 3 from the df1
, and the value of 5 from df2
.
The expected output should be something like
idx_1 time_1 idx_2 time_2
0 2 NA NA
1 3 NA NA
2 4 0 4.1
3 24 1 24.7
4 31 2 31.4
NA NA 3 5
Appreciate for any hint or any way that more compact and efficient than the proposed above.
Upvotes: 4
Views: 3683
Reputation: 28649
conditonal_join handles inequality joins efficiently, without having to resort to cartesian joins, and should help with your use case:
# pip install pyjanitor
import pandas as pd
import janitor
df1 = (df1
.astype({'time':float})
.rename(columns={'time':'time_1'})
.assign(idx_1=df1.index)
)
df2 = (df2
.assign(start = df2.time - th,
end = df2.time + th,
idx_2=df2.index)
.rename(columns={'time':'time_2'})
)
(df1
.conditional_join(
df2,
('time_1','start','>='),
('time_1','end','<='),
how='outer',
df_columns="*_1",
right_columns="*_2")
)
time_1 idx_1 time_2 idx_2
0 2.0 0.0 NaN NaN
1 3.0 1.0 NaN NaN
2 4.0 2.0 4.1 0.0
3 24.0 3.0 24.7 1.0
4 31.0 4.0 31.4 2.0
5 NaN NaN 5.0 3.0
Upvotes: 0
Reputation: 59519
You can perform a cross merge and then subset all the rows at once based on your condition. Then we concat
, adding back any rows that had no conditions met from both DataFrames.
import pandas as pd
df1 = df1.reset_index().add_suffix('_1')
df2 = df2.reset_index().add_suffix('_2')
m = df1.merge(df2, how='cross')
# Subset to all matches: |time_diff| <= thresh
th = 0.9
m = m[(m['time_1'] - m['time_2']).abs().le(th)]
# Add back rows with no matches
res = pd.concat([df1[~df1.index_1.isin(m.index_1)],
m,
df2[~df2.index_2.isin(m.index_2)]], ignore_index=True)
print(res)
index_1 time_1 index_2 time_2
0 0.0 2.0 NaN NaN
1 1.0 3.0 NaN NaN
2 2.0 4.0 0.0 4.1
3 3.0 24.0 1.0 24.7
4 4.0 31.0 2.0 31.4
5 NaN NaN 3.0 5.0
Upvotes: 5