Ariful Haque
Ariful Haque

Reputation: 163

pandas subset two dataframes based on multiple columns

So, I have two datasets(I mean dataframe) as follows dataframe 1:

name, age, id, acctno
abc, 23, 1001, 238238
dhd, 22, 2001, 299299
ddg, 30, 2920, 101010
ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111

(we can save this as population1.csv)

dataframe 2:

name, age, id, acctno
abc, 11, 1001, 238238
def, 55, 2001, 299299
xxy, 90, 2020, 101010

(we can save this as population2.csv)

So, we can read the data as follows

df1 = pd.read_csv('population1.csv')
df2 = pd.read_csv('population2.csv')

And, I want to get the following result

res = df1-df2

based on the common columns of id and acctno. We can see that based on id and acctno, dataframe2 is in dataframe1. But the dataframe1 has some more records that are not common in dataframe2.

Subsetting two dataframes based on one column is straightforeword, however would like to know how to subset two dataframes based on two columns.

So, the result should as something as follows

ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111

Upvotes: 3

Views: 2048

Answers (3)

user3483203
user3483203

Reputation: 51155

A solution using concat, where we concat df1, df2, then df2 again so all rows from df2 are guaranteed to be dropped (this may be memory inefficient if your DataFrames are very large):

pd.concat([df1, df2, df2]).drop_duplicates(['id', 'acctno'], keep=False)

  name  age    id  acctno
2  ddg   30  2920  101010
3  ssd   53  1901  238003
4  ggh   52  2221  222222
5  eet   50  9920  111111

If you have duplicates in df1 that you wish to keep, this answer will ignore your wishes.

Upvotes: 3

BENY
BENY

Reputation: 323266

isin with apply tuple

df1.loc[~df1[['id', 'acctno']].apply(tuple,1).isin(df2[['id', 'acctno']].apply(tuple,1))]
Out[215]: 
  name  age    id  acctno
2  ddg   30  2920  101010
3  ssd   53  1901  238003
4  ggh   52  2221  222222
5  eet   50  9920  111111

Upvotes: 3

piRSquared
piRSquared

Reputation: 294278

Setup

def rpd(text='', sep='\s{1,}', *args, **kwargs):
  kw = dict(engine='python', sep=sep)
  return pd.read_csv(pd.io.common.StringIO(text), *args, **kw, **kwargs)

df1 = rpd(sep=',\s*', text="""\
name, age, id, acctno
abc, 23, 1001, 238238
dhd, 22, 2001, 299299
ddg, 30, 2920, 101010
ssd, 53, 1901, 238003
ggh, 52, 2221, 222222
eet, 50, 9920, 111111""")

df2 = rpd(sep=',\s*', text="""\
name, age, id, acctno
abc, 11, 1001, 238238
def, 55, 2001, 299299
xxy, 90, 2020, 101010""")

mask

df2_tups = [*zip(df2.id, df2.acctno)]
mask = [t not in df2_tups for t in zip(df1.id, df1.acctno)]
df1[mask]

  name  age    id  acctno
2  ddg   30  2920  101010
3  ssd   53  1901  238003
4  ggh   52  2221  222222
5  eet   50  9920  111111

merge

The merge function/method has an indicator argument that if set to True adds a column that tells you which of the data sources the merging identifiers were in. In your case, you want to grab the ones that are left only.

df1.merge(
    df2[['id', 'acctno']], how='left', indicator=True
).query('_merge == "left_only"').drop('_merge', 1)

  name  age    id  acctno
2  ddg   30  2920  101010
3  ssd   53  1901  238003
4  ggh   52  2221  222222
5  eet   50  9920  111111

Google Colab

GitHub

Upvotes: 4

Related Questions