Reputation: 43
I have two dataframes (read from csvs) with identical columns but likely different rows. I'm trying to produce a third data frame that has the index on the left and a TRUE or FALSE value for each column where there is a match for a given index record. Here is a simplistic example
df1 = pd.DataFrame(np.array([
['100', 'a', 1, 'aa'],
['101', 'b', 2, 'bb'],
['102', 'c', 3, 'cc']]),
columns=['ID', 'Col1', 'Col2', 'Col3']).set_index('ID')
df2 = pd.DataFrame(np.array([
['100', 'a', 1, 'aa'],
['101', 'b', 2, 'bb'],
['102', 'c', 3, 'cb']]),
columns=['ID', 'Col1', 'Col2', 'Col3']).set_index('ID')
df3 = pd.DataFrame(np.where(df1==df2,True,False),
columns=df1.columns,
index=df1.index)
print(df3)
This produces:
ID
100 True True True
101 True True True
102 True True False
The issue I have is that the number of records will differ and could be out of order. There could be a row with ID 104 in df1 and a row with ID 105 in df2. This causes a "Can only compare identically-labeled DataFrame objects" error to be thrown.
I'm thinking np.where isn't flexible enough for this? Any advice would be greatly appreciated!
Upvotes: 2
Views: 2187
Reputation: 323236
In that case you may need to use eq
:
df1.eq(df2)
Col1 Col2 Col3
ID
100 True True True
101 True True True
102 False False False
105 False False False
Upvotes: 4