Reputation: 4274
I have below dataframes
df1 = pd.DataFrame({
'contact_id': [1,3,4,5,-1],
'subscription_id': ['AAA', 'ccc', 'ddd', 'eee', 'fff']
});
print(df1)
contact_id subscription_id
0 1 AAA
1 3 ccc
2 4 ddd
3 5 eee
4 -1 fff
2nd dataframe
df2 = pd.DataFrame({
'contact_id': [1,2,-1],
'subscription_id': ['AAA', 'bbb', 'fff'],
'extra': ['we', 'kl', 'op']
});
print(df2)
contact_id subscription_id extra
0 1 AAA we
1 2 bbb kl
2 -1 fff op
Expected Output
contact_id subscription_id extra
1 3 ccc NaN
2 4 ddd NaN
3 5 eee NaN
My Code
import pandas as pd
df1 = pd.DataFrame({
'contact_id': [1,3,4,5,-1],
'subscription_id': ['AAA', 'ccc', 'ddd', 'eee', 'fff']
});
print(df1)
df2 = pd.DataFrame({
'contact_id': [1,2,-1],
'subscription_id': ['AAA', 'bbb', 'fff'],
'extra': ['we', 'kl', 'op']
});
print(df2)
sub = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
print(sub)
Can anyone guide me where i am doing wrong?
Upvotes: 1
Views: 75
Reputation: 30070
sub = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
Problems in your code
df2
twice. (Though this doesn't matter since you then drop duplicates.)subset
argument of pandas.DataFrame.drop_duplicates
, by default pandas will use all of the columns to identify duplicates.Since extra
column is unnecessary, you can do with boolean indexing
df1 = df1.loc[~((df1['contact_id'].isin(df2['contact_id']))&(df1['subscription_id'].isin(df2['subscription_id'])))]
# print(df1)
contact_id subscription_id
1 3 ccc
2 4 ddd
3 5 eee
Upvotes: 0
Reputation: 34086
What you want is basically result of Left join
minus result of Inner Join
. This looks like a typical case of merge
not pd.concat
.
Use df.merge
with Left
join and indicator
column as True
. Pick rows which are present in df1
only by choosing left_only
:
In [1586]: df1.merge(df2, how='left', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
Out[1586]:
contact_id subscription_id extra
1 3 ccc NaN
2 4 ddd NaN
3 5 eee NaN
Upvotes: 3