Little Bobby Tables
Little Bobby Tables

Reputation: 4744

Why is the Index of a Pandas DataFrame Slice different to its shape?

I have a DataFrame, df1, which is a slice of df. df is multiindexed and has shape (8,). The slice removes some of the second level of df. When I do df1.shape it returns (4,) - all good - but when I do df1.index.levels[0] this returns (4,). How come this happens?

In [ ]:       
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
            np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]


    df = pd.DataFrame(np.random.randn(8,2), index=arrays)
    df

Out [ ]:
            0        1
bar one   -0.447155  -0.323073
    two    0.115899  -0.015561
baz one   -0.272498  1.847073
    two   -0.399948  -0.264327
foo one    0.169687  -1.708543
    two    1.154434  0.878936
qux one    0.535721  0.437186
    two   -1.203431  0.568412

In [ ]:
    df1=df[df[1]>0]

Out [ ]:
            0           1
    baz one  -0.272498  1.847073
    foo two  1.154434   0.878936
    qux one  0.535721   0.437186
        two  -1.203431  0.568412

Now for the weird bit

In [ ]:
    df1=df[df[1]>0]
    print(df1.index.levels[0], df1.index.levels[0].shape)

Out [ ]:
    Index(['bar', 'baz', 'foo', 'qux'], dtype='object') (4,)

I find this strange as there is no bar shown in df1. What is the reason behind this?

My guess is it is something to do with copying/not copying but I don't understand why.

Upvotes: 2

Views: 806

Answers (3)

unutbu
unutbu

Reputation: 879471

Consider the two indexes:

In [59]: df.index
Out[59]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]])

In [58]: df1.index
Out[58]: 
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[1, 2, 3, 3], [0, 1, 0, 1]])

When building df1 with df1 = df[df[1]>0] Pandas can build df1.index by merely changing the labels. Moreover, if it doesn't change the levels, then it doesn't have to renumber the labels. This is why df1.index contains bar even though df1 doesn't use bar.

You can rebuild the index by using reset_index/set_index:

In [63]: df1.reset_index().set_index(['level_0', 'level_1']).index
Out[63]: 
MultiIndex(levels=[[u'baz', u'foo', u'qux'], [u'one', u'two']],
           labels=[[0, 1, 2, 2], [0, 1, 0, 1]],
           names=[u'level_0', u'level_1'])

-- or use Alexander's faster solution, df1.index = pd.MultiIndex.from_tuples(df1.index) -- but Pandas doesn't do this by default probably for better performance.

Upvotes: 2

EdChum
EdChum

Reputation: 394031

It's because the levels are just the labels, it's the second level values that determine which labels are present for that label, so for instance in my case:

In [2]:
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
            np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
​
df = pd.DataFrame(np.random.randn(8,2), index=arrays)
df
​
Out[2]:
                0         1
bar one  1.226303  0.017598
    two  0.940893  1.491474
baz one  0.335430  1.178512
    two -1.006346 -0.733090
foo one -0.765838 -0.494056
    two -1.744994 -1.001641
qux one  0.177123 -0.969671
    two  0.544314 -0.026114

In [3]:    
df1=df[df[1]>0]
df1.index

Out[3]:
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1], [0, 1, 0]])

gives:

In [4]:
df1

Out[4]:
                0         1
bar one  1.226303  0.017598
    two  0.940893  1.491474
baz one  0.335430  1.178512

So if you look at the index:

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
               labels=[[0, 0, 1], [0, 1, 0]])

the values: labels=[[0, 0, 1], [0, 1, 0] are the values from the level values that are present which is why you see all 4 labels and the shape is 4

Upvotes: 1

Alexander
Alexander

Reputation: 109546

Per the docs:

Note The repr of a MultiIndex shows ALL the defined levels of an index, even if the they are not actually used. When slicing an index, you may notice this. ...

This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see the actual used levels...

To reconstruct the multiindex with only the used levels

df1.index = pd.MultiIndex.from_tuples(df1.index)

Upvotes: 3

Related Questions