Reputation: 123
I have two dataframes
>> df1
ID Hair Legs Feathers
1 1 0 0
2 1 2 1
3 0 2 1
>> df2
ID Hair Legs Feathers
21 1 2 0
22 1 0 1
I want to compare each row in df2
with all the rows in df1
and count the number of columns that are similar in each row of df2
in such a manner that it gives the following dataframe df3
>> df3
ID Hair Legs Feathers Count
1-21 1 2 0 2
2-21 1 2 0 2
3-12 1 2 0 1
1-22 1 0 1 2
2-22 1 0 1 2
3-22 1 0 1 1
The Count
is calculated in such a manner that the first row of df2
is compared with the first row of df1
and the number of similar columns are calculated. Similarly between the first row of df2
with second row of df1
and so on. Moreover, the second row of df2
is compared with all the rows of df1
one by one and stored in another dataframe df3
.
Any help will be highly appreciated
Upvotes: 3
Views: 143
Reputation: 862511
I believe you need:
#cross join between both DataFrames
df = df2.assign(A=1).merge(df1.assign(A=1), on='A', suffixes=('','_')).drop('A', axis=1)
#join ID columns and set index
df.index = df.pop('ID_').astype(str) + '_' + df.pop('ID').astype(str)
df.index.name='ID'
print (df)
Hair Legs Feathers Hair_ Legs_ Feathers_
ID
1_21 1 2 0 1 0 0
2_21 1 2 0 1 2 1
3_21 1 2 0 0 2 1
1_22 1 0 1 1 0 0
2_22 1 0 1 1 2 1
3_22 1 0 1 0 2 1
cols = df.filter(regex='_$').columns
#compare rows for match and count True values by sum
df['count'] = df[cols.str[:-1]].eq(df[cols].rename(columns=lambda x: x[:-1])).sum(axis=1)
df = df.drop(cols, axis=1).reset_index()
print (df)
ID Hair Legs Feathers count
0 1_21 1 2 0 2
1 2_21 1 2 0 2
2 3_21 1 2 0 1
3 1_22 1 0 1 2
4 2_22 1 0 1 2
5 3_22 1 0 1 1
Upvotes: 2