Reputation: 7174
I just ran into some weird behaviour comparing the values of two pandas dataframes using pd.Dataframe.equals()
:
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df2 = df1.copy()
df1.equals(df2)
# True (obviously)
However, when I change the column type to a different integer format, they will not be considered equal anymore:
df1['a'] = df1['a'].astype(np.int32)
df1.equals(df2)
# False
In the .equals()
documentation, they point out that the variables must have the same type, and present an example comparing floats to integers, which doesn't work. I didn't expect this to extend to different types of integers, too.
When doing the same comparison using ==
, it does return True
:
(df1 == df2).all().all()
# True
However, ==
doesn't assess two missing values as equal to each other.
Is there an elegant way to handle missing values as equal, whilst not enforcing the same integer type? The best I can come up with is:
(df1.fillna(0) == df2.fillna(0)).all().all()
but there has to be a more concise and less hacky way to deal with this problem.
My follow up, opinion-based question: Would you consider this a bug?
Upvotes: 8
Views: 2214
Reputation: 799
If you think of this as a decimal problem (i.e. does 2 equal 2) then this perhaps looks like a bug. However, if you look at it from how the interpreter sees it (i.e. does 00000010 equal 0000000000000010) then it becomes plain that there is indeed a difference. Bitwise operations.
From a validation perspective, it is probably a good idea to make sure you are comparing apples to apples and so I like the answer of @Ben.T:
df1.equals(df2.astype(df1.dtypes))
Is this a bug? That is above my pay grade. You can submit it, and the thinkers surrounding the pandas library can make a decision. It does seem odd that the '==' operator gives different results that the '.equals' function and that may sway the decision.
Upvotes: 3