Reputation: 988
I am looking for an efficient way to merge two pandas data frames based on a function that takes as input columns from both data frames and returns True or False. E.g. Assume I have the following "tables":
import pandas as pd
df_1 = pd.DataFrame(data=[1, 2, 3])
df_2 = pd.DataFrame(data=[4, 5, 6])
def validation(a, b):
return ((a + b) % 2) == 0
I would like to join df1 and df2 on each row where the sum of the first column is an even number. The resulting table would be
1 5
df_3 = 2 4
2 6
3 5
Please think of it as a general problem not as a task to return just df_3. The solution should accept any function that validates a combination of columns and return True or False.
THX Lazloo
Upvotes: 3
Views: 1713
Reputation: 150745
You can do with merge
on parity:
(df_1.assign(parity=df_1[0]%2)
.merge(df_2.assign(parity=df_2[0]%2), on='dummy')
.drop('parity', axis=1)
)
output:
0_x 0_y
0 1 5
1 3 5
2 2 4
3 2 6
Upvotes: 2
Reputation: 59549
You can use broadcasting, or the outer
functions, to compare all rows. You'll run into issues as the length becomes large.
import pandas as pd
import numpy as np
def validation(a, b):
"""a,b : np.array"""
arr = np.add.outer(a, b) # How to combine rows
i,j = np.where(arr % 2 == 0) # Condition
return pd.DataFrame(np.stack([a[i], b[j]], axis=1))
validation(df_1[0].to_numpy(), df_2[0].to_numpy())
0 1
0 1 5
1 2 4
2 2 6
3 3 5
In this particular case you might leverage the fact that even numbers maintain parity when added to even numbers, and odd numbers change parity when added to odd numbers, so define that column and merge on that.
df_1['parity'] = df_1[0]%2
df_2['parity'] = df_2[0]%2
df_3 = df_1.merge(df_2, on='parity')
0_x parity 0_y
0 1 1 5
1 3 1 5
2 2 0 4
3 2 0 6
Upvotes: 1
Reputation: 2417
This is a basic solution but not very efficient if you are working on large dataframes
df_1.index *= 0
df_2.index *= 0
df = df_1.join(df_2, lsuffix='_2')
df = df[df.sum(axis=1) % 2 == 0]
Edit, here is a better solution
df_1.index = df_1.iloc[:,0] % 2
df_2.index = df_2.iloc[:,0] % 2
df = df_1.join(df_2, lsuffix='_2')
Upvotes: 0