Reputation: 163
So, I have two datasets(I mean dataframe) as follows dataframe 1:
name, age, id, acctno
abc, 23, 1001, 238238
dhd, 22, 2001, 299299
ddg, 30, 2920, 101010
ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111
(we can save this as population1.csv)
dataframe 2:
name, age, id, acctno
abc, 11, 1001, 238238
def, 55, 2001, 299299
xxy, 90, 2020, 101010
(we can save this as population2.csv)
So, we can read the data as follows
df1 = pd.read_csv('population1.csv')
df2 = pd.read_csv('population2.csv')
And, I want to get the following result
res = df1-df2
based on the common columns of id
and acctno
.
We can see that based on id
and acctno
,
dataframe2 is in dataframe1. But the dataframe1 has some more
records that are not common in dataframe2.
Subsetting two dataframes based on one column is straightforeword, however would like to know how to subset two dataframes based on two columns.
So, the result should as something as follows
ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111
Upvotes: 3
Views: 2048
Reputation: 51155
A solution using concat
, where we concat
df1
, df2
, then df2
again so all rows from df2
are guaranteed to be dropped (this may be memory inefficient if your DataFrames are very large):
pd.concat([df1, df2, df2]).drop_duplicates(['id', 'acctno'], keep=False)
name age id acctno
2 ddg 30 2920 101010
3 ssd 53 1901 238003
4 ggh 52 2221 222222
5 eet 50 9920 111111
If you have duplicates in df1
that you wish to keep, this answer will ignore your wishes.
Upvotes: 3
Reputation: 323266
isin
with apply
tuple
df1.loc[~df1[['id', 'acctno']].apply(tuple,1).isin(df2[['id', 'acctno']].apply(tuple,1))]
Out[215]:
name age id acctno
2 ddg 30 2920 101010
3 ssd 53 1901 238003
4 ggh 52 2221 222222
5 eet 50 9920 111111
Upvotes: 3
Reputation: 294278
def rpd(text='', sep='\s{1,}', *args, **kwargs):
kw = dict(engine='python', sep=sep)
return pd.read_csv(pd.io.common.StringIO(text), *args, **kw, **kwargs)
df1 = rpd(sep=',\s*', text="""\
name, age, id, acctno
abc, 23, 1001, 238238
dhd, 22, 2001, 299299
ddg, 30, 2920, 101010
ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111""")
df2 = rpd(sep=',\s*', text="""\
name, age, id, acctno
abc, 11, 1001, 238238
def, 55, 2001, 299299
xxy, 90, 2020, 101010""")
mask
df2_tups = [*zip(df2.id, df2.acctno)]
mask = [t not in df2_tups for t in zip(df1.id, df1.acctno)]
df1[mask]
name age id acctno
2 ddg 30 2920 101010
3 ssd 53 1901 238003
4 ggh 52 2221 222222
5 eet 50 9920 111111
merge
The merge
function/method has an indicator
argument that if set to True
adds a column that tells you which of the data sources the merging identifiers were in. In your case, you want to grab the ones that are left only.
df1.merge(
df2[['id', 'acctno']], how='left', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)
name age id acctno
2 ddg 30 2920 101010
3 ssd 53 1901 238003
4 ggh 52 2221 222222
5 eet 50 9920 111111
Upvotes: 4