Reputation: 4744
I have a DataFrame, df1
, which is a slice of df
. df
is multiindexed and has shape (8,)
. The slice removes some of the second level of df
. When I do df1.shape
it returns (4,)
- all good - but when I do df1.index.levels[0]
this returns (4,)
. How come this happens?
In [ ]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,2), index=arrays)
df
Out [ ]:
0 1
bar one -0.447155 -0.323073
two 0.115899 -0.015561
baz one -0.272498 1.847073
two -0.399948 -0.264327
foo one 0.169687 -1.708543
two 1.154434 0.878936
qux one 0.535721 0.437186
two -1.203431 0.568412
In [ ]:
df1=df[df[1]>0]
Out [ ]:
0 1
baz one -0.272498 1.847073
foo two 1.154434 0.878936
qux one 0.535721 0.437186
two -1.203431 0.568412
Now for the weird bit
In [ ]:
df1=df[df[1]>0]
print(df1.index.levels[0], df1.index.levels[0].shape)
Out [ ]:
Index(['bar', 'baz', 'foo', 'qux'], dtype='object') (4,)
I find this strange as there is no bar
shown in df1
. What is the reason behind this?
My guess is it is something to do with copying/not copying but I don't understand why.
Upvotes: 2
Views: 806
Reputation: 879471
Consider the two indexes:
In [59]: df.index
Out[59]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])
In [58]: df1.index
Out[58]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[1, 2, 3, 3], [0, 1, 0, 1]])
When building df1
with df1 = df[df[1]>0]
Pandas can build df1.index
by
merely changing the labels
. Moreover, if it doesn't change the levels
, then
it doesn't have to renumber the labels
. This is why df1.index
contains bar
even though df1
doesn't use bar
.
You can rebuild the index by using reset_index/set_index
:
In [63]: df1.reset_index().set_index(['level_0', 'level_1']).index
Out[63]:
MultiIndex(levels=[[u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 1, 2, 2], [0, 1, 0, 1]],
names=[u'level_0', u'level_1'])
-- or use Alexander's faster solution, df1.index = pd.MultiIndex.from_tuples(df1.index)
-- but Pandas
doesn't do this by default probably for better performance.
Upvotes: 2
Reputation: 394031
It's because the levels are just the labels, it's the second level values that determine which labels are present for that label, so for instance in my case:
In [2]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8,2), index=arrays)
df
Out[2]:
0 1
bar one 1.226303 0.017598
two 0.940893 1.491474
baz one 0.335430 1.178512
two -1.006346 -0.733090
foo one -0.765838 -0.494056
two -1.744994 -1.001641
qux one 0.177123 -0.969671
two 0.544314 -0.026114
In [3]:
df1=df[df[1]>0]
df1.index
Out[3]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1], [0, 1, 0]])
gives:
In [4]:
df1
Out[4]:
0 1
bar one 1.226303 0.017598
two 0.940893 1.491474
baz one 0.335430 1.178512
So if you look at the index:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1], [0, 1, 0]])
the values: labels=[[0, 0, 1], [0, 1, 0]
are the values from the level values that are present which is why you see all 4 labels and the shape is 4
Upvotes: 1
Reputation: 109546
Per the docs:
Note The repr of a MultiIndex shows ALL the defined levels of an index, even if the they are not actually used. When slicing an index, you may notice this. ...
This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see the actual used levels...
To reconstruct the multiindex with only the used levels
df1.index = pd.MultiIndex.from_tuples(df1.index)
Upvotes: 3