Mahdi
Mahdi

Reputation: 235

Pandas df.equals() returning False on identical dataframes?

Let df_1 and df_2 be:

In [1]: import pandas as pd
   ...: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
   ...: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

In [2]: df_1
Out[2]:
   a  b
0  1  4
1  2  5
2  3  6

We add a row r to df_1:

In [3]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})
   ...: df_1 = df_1.append(r, ignore_index=True)

In [4]: df_1
Out[4]:
   a  b
0  1  4
1  2  5
2  3  6
3  x  y

We now remove the added row from df_1 and get the original df_1 back again:

In [5]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)

In [6]: df_1
Out[6]:
   a  b
0  1  4
1  2  5
2  3  6

In [7]: df_2
Out[7]:
   a  b
0  1  4
1  2  5
2  3  6

While df_1 and df_2 are identical, equals() returns False.

In [8]: df_1.equals(df_2)
Out[8]: False

Did reseach on SO but could not find a related question. Am I doing somthing wrong? How to get the correct result in this case? (df_1==df_2).all().all() returns True but not suitable for the case where df_1 and df_2 have different length.

Upvotes: 6

Views: 13593

Answers (4)

smci
smci

Reputation: 33940

Use pandas.testing.assert_frame_equal(df_1, df_2, check_dtype=True), which will also check if the dtypes are the same.

(It will pick up in this case that your dtypes changed from int to 'object' (string) when you appended, then deleted, a string row; pandas did not automatically coerce the dtype back down to less expansive dtype.)

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different

Attribute "dtype" are different
[left]:  object
[right]: int64

Upvotes: 7

Mahdi
Mahdi

Reputation: 235

Based on the comments of the others, in this case one can do:

from pandas.util.testing import assert_frame_equal

identical_df = True
try:
    assert_frame_equal(df_1, df_2, check_dtype=False)
except AssertionError:
    identical_df = False

Upvotes: 1

Mayank Porwal
Mayank Porwal

Reputation: 34046

As per df.equals docs:

This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal. The column headers do not need to have the same type, but the elements within the columns must be the same dtype.

So, df.equals will return True only when the elements have same values and the dtypes is also same.

When you add and delete the row from df_1, the dtypes changes from int to object, hence it returns False.

Explanation with your example:

In [1028]: df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

In [1029]: df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

 In [1031]: df_1.dtypes
Out[1031]: 
a    int64
b    int64
dtype: object

In [1032]: df_2.dtypes
Out[1032]: 
a    int64
b    int64
dtype: object

So, if you see above, dtypes of both dfs are same, hence below condition returns True:

In [1030]: df_1.equals(df_2)
Out[1030]: True

Now after you add and remove the row:

In [1033]: r = pd.DataFrame({'a': ['x'], 'b': ['y']})

In [1034]: df_1 = df_1.append(r, ignore_index=True)

In [1036]: df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)

In [1038]: df_1.dtypes
Out[1038]: 
a    object
b    object
dtype: object

dtype has changed to object, hence below condition returns False:

In [1039]: df_1.equals(df_2)
Out[1039]: False

If you still want it to return True, you need to change the dtypes back to int:

In [1042]: df_1 = df_1.astype(int)
In [1044]: df_1.equals(df_2)
Out[1044]: True

Upvotes: 3

Paul Brennan
Paul Brennan

Reputation: 2696

This again is a subtle one, well done for spotting it.

import pandas as pd
df_1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df_2 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
r = pd.DataFrame({'a': ['x'], 'b': ['y']})
df_1 = df_1.append(r, ignore_index=True)
df_1 = pd.concat([df_1, r]).drop_duplicates(keep=False)
df_1.equals(df_2)

from pandas.util.testing import assert_frame_equal
assert_frame_equal(df_1,df_2)

Now we can see the issue as the assert fails.

AssertionError: Attributes of DataFrame.iloc[:, 0] (column name="a") are different

Attribute "dtype" are different
[left]:  object
[right]: int64

as you added strings to integers the integers became objects. so this is why the equals fails as well..

Upvotes: 10

Related Questions