Reputation: 3288
I'm looking to merge two dataframes across multiple columns but with some additional conditions.
import pandas as pd
df1 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X',None,'Z','V'],
'optional_col3': [None,'def', 'ghi','jkl']
})
df2 = pd.DataFrame({
'col1': ['a','b','c', 'd'],
'optional_col2': ['X','Y','Z','W'],
'optional_col3': ['abc', 'def', 'ghi','mno']
})
I would like to always join on col1
but then try to also join on optional_col2
and optional_col3
. In df1
, the value can be NaN
for both columns but it is always populated in df2
. I would like the join to be valid when the col1
+ one of optional_col2
or optional_col3
match.
This would result in ['a', 'b', 'c']
joining due to exact col2
, col3
, and exact match, respectively.
In SQL I suppose you could write the join as this, if it helps explain further:
select
*
from
df1
inner join
df2
on df1.col1 = df2.col2
AND (df1.optional_col2 = df2.optional_col2 OR df1.optional_col3 = df2.optional_col3)
I've messed around with pd.merge
but can't figure how to do a complex operation like this. I think I can do a merge on ['col1', 'optional_col2']
then a second merge on ['col1', 'optional_col_3']
then union and drop duplicates?
Expected DataFrame would be something like:
merged_df = pd.DataFrame({
'col1': ['a', 'b', 'c'],
'optional_col_2': ['X', 'Y', 'Z'],
'optional_col_3': ['abc', 'def', 'ghi']
})
Upvotes: 1
Views: 201
Reputation: 2261
This solution works by creating an extra column called "temp" in both dataframes. In df11
it will be a column of true values. In df2
the values will be true if there is a match between either of the optional columns. I'm not clear whether you consider a NaN
value to be matchable or not, if so then you need to fill in the NaN
s of columns in df1
with values from df2
before comparing to fulfill your criteria around missing values (this is what is below). If this is not required then drop the fillna
calls in the example below.
df1["temp"] = True
optional_col2_match = df1["optional_col2"].fillna(df2["optional_col2"]).eq(df2["optional_col2"])
optional_col3_match = df1["optional_col3"].fillna(df2["optional_col3"]).eq(df2["optional_col3"])
df2["temp"] = optional_col2_match | optional_col3_match
Then use the "temp" column in the merge, and then drop it - it has served its purpose
pd.merge(df1, df2, on=["col1", "temp"]).drop(columns="temp")
This gives the following result
col1 optional_col2_x optional_col3_x optional_col2_y optional_col3_y
0 a X abc X abc
1 b Y def Y def
2 c Z ghi Z ghi
You will need to decide what to do here. In the example you gave there are no rows which match on just one of optional_col2
and optional_col2
, which is why a 3 column solution looks reasonable. This won't generally be the case.
Upvotes: 2