eshfaq ahmad
eshfaq ahmad

Reputation: 123

Comparing rows in 2 dataframes and counting number of similar columns

I have two dataframes

>> df1
ID  Hair  Legs  Feathers
 1    1     0      0
 2    1     2      1
 3    0     2      1

>> df2
ID  Hair  Legs  Feathers
21   1      2     0
22   1      0     1

I want to compare each row in df2 with all the rows in df1 and count the number of columns that are similar in each row of df2 in such a manner that it gives the following dataframe df3

>> df3
ID    Hair  Legs  Feathers  Count
1-21   1      2      0        2
2-21   1      2      0        2
3-12   1      2      0        1
1-22   1      0      1        2
2-22   1      0      1        2
3-22   1      0      1        1

The Count is calculated in such a manner that the first row of df2 is compared with the first row of df1 and the number of similar columns are calculated. Similarly between the first row of df2 with second row of df1 and so on. Moreover, the second row of df2 is compared with all the rows of df1 one by one and stored in another dataframe df3.

Any help will be highly appreciated

Upvotes: 3

Views: 143

Answers (1)

jezrael
jezrael

Reputation: 862511

I believe you need:

#cross join between both DataFrames
df = df2.assign(A=1).merge(df1.assign(A=1), on='A', suffixes=('','_')).drop('A', axis=1)
#join ID columns and set index
df.index = df.pop('ID_').astype(str) + '_' + df.pop('ID').astype(str)
df.index.name='ID'
print (df)
      Hair  Legs  Feathers  Hair_  Legs_  Feathers_
ID                                                 
1_21     1     2         0      1      0          0
2_21     1     2         0      1      2          1
3_21     1     2         0      0      2          1
1_22     1     0         1      1      0          0
2_22     1     0         1      1      2          1
3_22     1     0         1      0      2          1

cols = df.filter(regex='_$').columns
#compare rows for match and count True values by sum
df['count'] = df[cols.str[:-1]].eq(df[cols].rename(columns=lambda x: x[:-1])).sum(axis=1)

df = df.drop(cols, axis=1).reset_index()
print (df)
     ID  Hair  Legs  Feathers  count
0  1_21     1     2         0      2
1  2_21     1     2         0      2
2  3_21     1     2         0      1
3  1_22     1     0         1      2
4  2_22     1     0         1      2
5  3_22     1     0         1      1

Upvotes: 2

Related Questions