Evaluating equality of sorted pandas dataframes does not behave as expected

Question

I would like to compare two pd.dataframes for equality:

foo = pd.DataFrame([['between', 1.5], ['between', 2], 
                    ['between', 2.0], ['within', 2.0]], 
                   columns=['Group', 'Distance'])

bar = pd.DataFrame([['between', 2], ['between', 1.5], 
                    ['within', 2.0], ['between', 2.0]], 
                   columns=['Group', 'Distance'])

As far as I am concerned these two dataframes are identical, however I realize pandas does not agree because they are not in the same order. My thought was that I could sort and then reindex

foo = foo.sort_values('Distance').reset_index(drop=True)
bar = bar.sort_values('Distance').reset_index(drop=True)

Pandas sort gives different results because of the initial ordering of the dataframes. And in fact they don't evaluate as being equivalent:

foo.equals(bar)
False

I could first sort on Group and then on Distance and this would return True, however in dealing with larger dataframes I'm concerned about having to explicitly define sorting rules each time. Is there a better way of comparing two differently ordered dataframes?

zipa · Accepted Answer

This way you can make them evaluate to True:

foo.sort_values(foo.columns.values.tolist()).reset_index(drop=True).equals(bar.sort_values(foo.columns.values.tolist()).reset_index(drop=True))

Or

foo = foo.sort_values(foo.columns.values.tolist()).reset_index(drop=True)
bar = bar.sort_values(foo.columns.values.tolist()).reset_index(drop=True)
foo.equals(bar)
True

Evaluating equality of sorted pandas dataframes does not behave as expected

Answers (1)

Related Questions