Pandas Column Names of MultiIndex DataFrame - strange behaviour

Question

I observed some strange pandas behavior with MultiIndex dataFrames.columns

Construction a MultiIndex dataframe:

a=[0,.25, .5, .75]
b=[1, 2, 3, 4]
c=[5, 6, 7, 8]
d=[1, 2, 3, 5]
df=pd.DataFrame(data={('a','a'):a, ('b', 'b'):b, ('c', 'c'):c, ('d', 'd'):d})

produces this dataFrame

      a  b  c  d
      a  b  c  d
0  0.00  1  5  1
1  0.25  2  6  2
2  0.50  3  7  3
3  0.75  4  8  5

Creating a new variable with a subset of the original dataFrame

df1=df.copy().loc[:,[('a', 'a'), ('b', 'b')]]

produces like expected:

but accessing the column names of this new dataFrame produces some unexpected output:

print df1.columns

MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [u'a', u'b', u'c', u'd']],
           labels=[[0, 1], [0, 1]])

so ('b', 'b') and ('c', 'c') is still contained.

In contrast

print df1.columns.tolist()

returns like expected:

[('a', 'a'), ('b', 'b')]

can anybody explain me the reason for this behavior??

jezrael · Accepted Answer

I think you need MultiIndex.remove_unused_levels what is new function in 0.20.0 version.

Docs.

print (df1.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['a', 'b', 'c', 'd']],
           labels=[[0, 1], [0, 1]])

print (df1.columns.remove_unused_levels())
MultiIndex(levels=[['a', 'b'], ['a', 'b']],
           labels=[[0, 1], [0, 1]])

Pandas Column Names of MultiIndex DataFrame - strange behaviour

Answers (1)

Related Questions