sameer
sameer

Reputation: 79

Filter dataframes based on 2 columns of another dataframe in python

I have a DataFrame like this:

data = {'Name':['Tom', 'Jack', 'nick', 'juli', 'Tom', 'nick', 'juli','nick', 'juli','Tom'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo', 'maths', 'maths', 'geo', 'science'], 'marks':[99, 98, 95, 90, 99, 98, 97, 95, 96, 98]}
df1 = pd.DataFrame(data)

df1

 Name    subject    marks
0   Tom     eng     99
1   Jack    maths   98
2   nick    geo     95
3   juli    maths   90
4   Tom     science 99
5   nick    geo     98
6   juli    maths   97
7   nick    maths   95
8   juli    geo     96
9   Tom     science 98

another dataframe as :

data2 = {'Name':['Jack', 'nick', 'Tom',  'juli', 'Tom', 'nick','nick', 'juli'], 'subject': ['eng', 'maths', 'geo', 'maths', 'science', 'geo',  'maths', 'geo']}
df2 = pd.DataFrame(data2)


df2

    Name    subject
0   Jack    eng
1   nick    maths
2   Tom     geo
3   juli    maths
4   Tom     science
5   nick    geo
6   nick    maths
7   juli    geo

I want to filter df2 based on combination of 'Names' and 'subject' in df1. If a particular combination of 'Name' and 'subject' in df1 appears more than once and then it is matched in df2. If it matches then we get those rows from df2 as output.

Desired output:

pd.DataFrame({'Names':['Tom', 'juli', 'nick'], 'subject': ['science', 'maths', 'geo']})

    Name    subject 
0   nick    geo 
1   juli    maths
2   Tom     science

can anyone help without using 'merge' option?

Upvotes: 1

Views: 48

Answers (1)

jezrael
jezrael

Reputation: 862581

I believe you need filter rows with duplicated values only by DataFrame.duplicated with keep=False chained without this parameter and for them first rows and then use merge for inner join:

df11 = df1[df1.duplicated(subset=['Name','subject'], keep=False) & 
           df1.duplicated(subset=['Name','subject'])]

df3 = df11.merge(df2, suffixes=('_',''))[df2.columns]
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science

Another similar idea is filter columns by df2 in merge:

cols = df2.columns
df11 = df1.loc[df1[cols].duplicated(keep=False) & df1[cols].duplicated(), cols]
df3 = df11.merge(df2)
print (df3)
   Name  subject
0  nick      geo
1  juli    maths
2   Tom  science
    

Upvotes: 1

Related Questions