johnchase
johnchase

Reputation: 13715

Evaluating equality of sorted pandas dataframes does not behave as expected

I would like to compare two pd.dataframes for equality:

foo = pd.DataFrame([['between', 1.5], ['between', 2], 
                    ['between', 2.0], ['within', 2.0]], 
                   columns=['Group', 'Distance'])

bar = pd.DataFrame([['between', 2], ['between', 1.5], 
                    ['within', 2.0], ['between', 2.0]], 
                   columns=['Group', 'Distance'])

As far as I am concerned these two dataframes are identical, however I realize pandas does not agree because they are not in the same order. My thought was that I could sort and then reindex

foo = foo.sort_values('Distance').reset_index(drop=True)
bar = bar.sort_values('Distance').reset_index(drop=True)

Pandas sort gives different results because of the initial ordering of the dataframes. And in fact they don't evaluate as being equivalent:

foo.equals(bar)
False

I could first sort on Group and then on Distance and this would return True, however in dealing with larger dataframes I'm concerned about having to explicitly define sorting rules each time. Is there a better way of comparing two differently ordered dataframes?

Upvotes: 1

Views: 100

Answers (1)

zipa
zipa

Reputation: 27879

This way you can make them evaluate to True:

foo.sort_values(foo.columns.values.tolist()).reset_index(drop=True).equals(bar.sort_values(foo.columns.values.tolist()).reset_index(drop=True))

Or

foo = foo.sort_values(foo.columns.values.tolist()).reset_index(drop=True)
bar = bar.sort_values(foo.columns.values.tolist()).reset_index(drop=True)
foo.equals(bar)
True

Upvotes: 2

Related Questions