Lazloo Xp
Lazloo Xp

Reputation: 988

Merge Pandas Dataframe based on boolean function

I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":

import pandas as pd

df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])


def validation(a, b):
    return ((a + b) % 2) == 0

I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be

       1 5
df_3 = 2 4
       2 6
       3 5

Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.

THX Lazloo

Upvotes: 3

Views: 1713

Answers (3)

Quang Hoang
Quang Hoang

Reputation: 150745

You can do with merge on parity:

(df_1.assign(parity=df_1[0]%2)
     .merge(df_2.assign(parity=df_2[0]%2), on='dummy')
     .drop('parity', axis=1)
)

output:

   0_x  0_y
0    1    5
1    3    5
2    2    4
3    2    6

Upvotes: 2

ALollz
ALollz

Reputation: 59549

You can use broadcasting, or the outer functions, to compare all rows. You'll run into issues as the length becomes large.

import pandas as pd
import numpy as np

def validation(a, b):
    """a,b : np.array"""
    arr = np.add.outer(a, b)     # How to combine rows
    i,j = np.where(arr % 2 == 0) # Condition

    return pd.DataFrame(np.stack([a[i], b[j]], axis=1))

validation(df_1[0].to_numpy(), df_2[0].to_numpy())

   0  1
0  1  5
1  2  4
2  2  6
3  3  5

In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.

df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2

df_3 = df_1.merge(df_2, on='parity')

   0_x  parity  0_y
0    1       1    5
1    3       1    5
2    2       0    4
3    2       0    6

Upvotes: 1

Ayoub ZAROU
Ayoub ZAROU

Reputation: 2417

This is a basic solution but not very efficient if you are working on large dataframes

df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]

Edit, here is a better solution

df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')

Upvotes: 0

Related Questions