Sangram Badi
Sangram Badi

Reputation: 4274

How to subtract a dataframe from a dataframe based on columns?

I have below dataframes

df1 = pd.DataFrame({
    'contact_id': [1,3,4,5,-1],
    'subscription_id': ['AAA', 'ccc', 'ddd', 'eee', 'fff']
});

print(df1)

   contact_id subscription_id
0           1             AAA
1           3             ccc
2           4             ddd
3           5             eee
4          -1             fff

2nd dataframe

df2 = pd.DataFrame({
    'contact_id': [1,2,-1],
    'subscription_id': ['AAA', 'bbb', 'fff'],
    'extra': ['we', 'kl', 'op']
});

print(df2)

   contact_id subscription_id extra
0           1             AAA    we
1           2             bbb    kl
2          -1             fff    op

Expected Output

   contact_id subscription_id extra
1           3             ccc   NaN
2           4             ddd   NaN
3           5             eee   NaN

My Code

import pandas as pd

df1 = pd.DataFrame({
    'contact_id': [1,3,4,5,-1],
    'subscription_id': ['AAA', 'ccc', 'ddd', 'eee', 'fff']
});

print(df1)

df2 = pd.DataFrame({
    'contact_id': [1,2,-1],
    'subscription_id': ['AAA', 'bbb', 'fff'],
    'extra': ['we', 'kl', 'op']
});

print(df2)

sub = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)
print(sub)

Can anyone guide me where i am doing wrong?

Upvotes: 1

Views: 75

Answers (2)

Ynjxsjmh
Ynjxsjmh

Reputation: 30070

sub = pd.concat([df1, df2, df2]).drop_duplicates(keep=False)

Problems in your code

  1. You are concatting df2 twice. (Though this doesn't matter since you then drop duplicates.)
  2. If no set subset argument of pandas.DataFrame.drop_duplicates, by default pandas will use all of the columns to identify duplicates.

Since extra column is unnecessary, you can do with boolean indexing

df1 = df1.loc[~((df1['contact_id'].isin(df2['contact_id']))&(df1['subscription_id'].isin(df2['subscription_id'])))]
# print(df1)

   contact_id subscription_id
1           3             ccc
2           4             ddd
3           5             eee

Upvotes: 0

Mayank Porwal
Mayank Porwal

Reputation: 34086

What you want is basically result of Left join minus result of Inner Join. This looks like a typical case of merge not pd.concat.

Use df.merge with Left join and indicator column as True. Pick rows which are present in df1 only by choosing left_only:

In [1586]: df1.merge(df2, how='left', indicator=True).query('_merge == "left_only"').drop('_merge', 1)
Out[1586]: 
   contact_id subscription_id extra
1           3             ccc   NaN
2           4             ddd   NaN
3           5             eee   NaN

Upvotes: 3

Related Questions