Reputation: 6154
I have the same data coming from two sources. Both are in pandas DataFrame
s. A few of things are happening:
source 1
(left) is only a subset of the data from source 2
(right).source 2
has one additional column.Here is a simple example:
df1 = pd.DataFrame(
{ # last entry a duplicate of the first
'var1' : [1, 2, 3, 1,],
'var2' : ['a', 'b', 'c', 'a']
}
)
df2 = pd.DataFrame(
{ # last two entries duplicates of first two.
'cat': [1, 2, 3, 4, 1, 2],
'var1' : [1, 2, 3, 4, 1, 2],
'var2' : ['a', 'b', 'c', 'd', 'a', 'b']
}
)
merged = pd.merge(df1, df2, on=["var1", "var2"], how="inner")
merged
results in:
# - indicates undesired duplicates
var1 var2 cat
0 1 a 1
1 1 a 1
2 1 a 1 #
3 1 a 1 #
4 2 b 2
5 2 b 2 #
6 3 c 3
But what I'm expecting is (order doesn't matter):
var1 var2 cat
0 1 a 1
1 1 a 1
2 2 b 2
3 3 c 3
I took a look at dropping duplicates, but it does not discriminate and destroys valuable information.
merged.drop_duplicates()
var1 var2 cat
0 1 a 1
4 2 b 2
6 3 c 3
One thought I had was to figure out a way to have two more binary columns that identifies the source before merging and drop any rows where both columns are not true.
var1 var2 cat src1 src2
0 1 a 1 1 1
1 1 a 1 1 1
2 1 a 1 0 1 # drop
3 1 a 1 0 1 # drop
4 2 b 2 1 1
5 2 b 2 0 1 # drop
6 3 c 3 1 1
However, I'm not sure of how to achieve such labelling during a merge (if it's even possible).
So, in short, the question is "How do I merge two tables, allowing preserving duplicates only when they exist in both tables"?
Upvotes: 1
Views: 49
Reputation: 150735
Try to merge with enumeration:
key_cols = ['var1','var2']
(df1.assign(enum=df1.groupby(key_cols).cumcount())
.merge(df2.assign(enum=df2.groupby(key_cols).cumcount()),
on=['enum'] + key_cols, how='inner')
.drop('enum', axis=1)
)
Output:
var1 var2 cat
0 1 a 1
1 2 b 2
2 3 c 3
3 1 a 1
Upvotes: 1