Reputation: 89395
To confirm that I understand what Pandas df.groupby()
and df.reset_index()
do, I attempted to do a round-trip from a dataframe to a grouped version of the same data and back. After the round-trip the columns and rows had to be sorted again, because groupby()
affects row order and reset_index()
affects column order, but after two quick maneuvers to put the columns and index back in order, the dataframes look identical:
Yet, after all of these checks succeed, df1.equals(df5)
returns the astounding value False
.
What difference between these dataframes is equals()
uncovering that I have not yet figured out how to check for myself?
Test code:
csv_text = """\
Title,Year,Director
North by Northwest,1959,Alfred Hitchcock
Notorious,1946,Alfred Hitchcock
The Philadelphia Story,1940,George Cukor
To Catch a Thief,1955,Alfred Hitchcock
His Girl Friday,1940,Howard Hawks
"""
import pandas as pd
df1 = pd.read_csv('sample.csv')
df1.columns = map(str.lower, df1.columns)
print(df1)
df2 = df1.groupby(['director', df1.index]).first()
df3 = df2.reset_index('director')
df4 = df3[['title', 'year', 'director']]
df5 = df4.sort_index()
print(df5)
print()
print(repr(df1.columns))
print(repr(df5.columns))
print()
print(df1.dtypes)
print(df5.dtypes)
print()
print(df1 == df5)
print()
print(df1.index == df5.index)
print()
print(df1.equals(df5))
The output that I receive when I run the script is:
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
title year director
0 North by Northwest 1959 Alfred Hitchcock
1 Notorious 1946 Alfred Hitchcock
2 The Philadelphia Story 1940 George Cukor
3 To Catch a Thief 1955 Alfred Hitchcock
4 His Girl Friday 1940 Howard Hawks
Index(['title', 'year', 'director'], dtype='object')
Index(['title', 'year', 'director'], dtype='object')
title object
year int64
director object
dtype: object
title object
year int64
director object
dtype: object
title year director
0 True True True
1 True True True
2 True True True
3 True True True
4 True True True
[ True True True True True]
False
Thanks for any help!
Upvotes: 9
Views: 6525
Reputation: 353019
This feels like a bug to me, but could be simply that I'm misunderstanding something. The blocks are listed in a different order:
>>> df1._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
>>> df5._data
BlockManager
Items: Index(['title', 'year', 'director'], dtype='object')
Axis 1: Int64Index([0, 1, 2, 3, 4], dtype='int64')
ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object
IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64
In core/internals.py
, we have the BlockManager
method
def equals(self, other):
self_axes, other_axes = self.axes, other.axes
if len(self_axes) != len(other_axes):
return False
if not all (ax1.equals(ax2) for ax1, ax2 in zip(self_axes, other_axes)):
return False
self._consolidate_inplace()
other._consolidate_inplace()
return all(block.equals(oblock) for block, oblock in
zip(self.blocks, other.blocks))
and that last all
assumes that the blocks in self
and other
correspond. But if we add some print
calls before it, we see:
>>> df1.equals(df5)
blocks self: (IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64, ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object)
blocks other: (ObjectBlock: slice(0, 4, 2), 2 x 5, dtype: object, IntBlock: slice(1, 2, 1), 1 x 5, dtype: int64)
False
and so we're comparing the wrong things. The reason I'm not sure whether or not this is a bug is because I'm not sure whether equals
is meant to be this finicky or not. If so, I think there's a doc bug, at least, because equals
should then shout that it's not meant to be used for what you might think it would be from the name and the docstring.
Upvotes: 7