rpb
rpb

Reputation: 3299

Merge two pandas dataframe based on conditional

The objective is to combine two df row wise, if a predetermine condition is met. Specifically, if the difference between the column is less or equal than a threshold, then, join the row of the df.

Given two df: df1 and df2, the following code partially achieve the objective.

import pandas as pd

df1 = pd.DataFrame ( {'time': [2, 3, 4, 24, 31]} )
df2 = pd.DataFrame (  {'time': [4.1, 24.7, 31.4, 5]} )
th = 0.9
all_comb=[]
for index, row in df1.iterrows ():
    for index2, row2 in df2.iterrows ():
        diff = abs ( row ['time'] - row2 ['time'] )
        if diff <= th:
            all_comb.append({'idx_1':index,'time_1':row ['time'], 'idx_2':index2,'time_2':row2 ['time']})
df_all = pd.DataFrame(all_comb)

outputted

       idx_1  time_1  idx_2  time_2
0      2       4      0     4.1
1      3      24      1    24.7
2      4      31      2    31.4

However, the above approach ignore certain information i.e., the value of 2 and 3 from the df1, and the value of 5 from df2.

The expected output should be something like

idx_1  time_1  idx_2  time_2

0      2       NA    NA
1      3       NA    NA    
2       4      0     4.1
3      24      1    24.7
4      31      2    31.4
NA     NA      3     5

Appreciate for any hint or any way that more compact and efficient than the proposed above.

Upvotes: 4

Views: 3683

Answers (2)

sammywemmy
sammywemmy

Reputation: 28649

conditonal_join handles inequality joins efficiently, without having to resort to cartesian joins, and should help with your use case:

# pip install pyjanitor
import pandas as pd
import janitor

df1 = (df1
      .astype({'time':float})
      .rename(columns={'time':'time_1'})
      .assign(idx_1=df1.index)
      )

df2 = (df2
      .assign(start = df2.time - th, 
              end = df2.time + th,
              idx_2=df2.index)
      .rename(columns={'time':'time_2'})
     )

(df1
.conditional_join(
    df2, 
    ('time_1','start','>='), 
    ('time_1','end','<='),
    how='outer', 
    df_columns="*_1", 
    right_columns="*_2")
) 
   time_1  idx_1  time_2  idx_2
0     2.0    0.0     NaN    NaN
1     3.0    1.0     NaN    NaN
2     4.0    2.0     4.1    0.0
3    24.0    3.0    24.7    1.0
4    31.0    4.0    31.4    2.0
5     NaN    NaN     5.0    3.0

Upvotes: 0

ALollz
ALollz

Reputation: 59519

You can perform a cross merge and then subset all the rows at once based on your condition. Then we concat, adding back any rows that had no conditions met from both DataFrames.

import pandas as pd

df1 = df1.reset_index().add_suffix('_1')
df2 = df2.reset_index().add_suffix('_2')

m = df1.merge(df2, how='cross')

# Subset to all matches: |time_diff| <= thresh
th = 0.9
m = m[(m['time_1'] - m['time_2']).abs().le(th)]

# Add back rows with no matches
res = pd.concat([df1[~df1.index_1.isin(m.index_1)],
                 m,
                 df2[~df2.index_2.isin(m.index_2)]], ignore_index=True)

print(res)
   index_1  time_1  index_2  time_2
0      0.0     2.0      NaN     NaN
1      1.0     3.0      NaN     NaN
2      2.0     4.0      0.0     4.1
3      3.0    24.0      1.0    24.7
4      4.0    31.0      2.0    31.4
5      NaN     NaN      3.0     5.0

Upvotes: 5

Related Questions