Reputation: 440
Hi here is an example of behaviour I don't understand. Here is an example of multiindex in columns
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(3, 8), index=['A', 'B', 'C'], columns=index)
Now i'd like to select a subset of df by the first level and return the relevant columns:
df.loc[:, ['bar']].columns
returns
MultiIndex(levels=[['bar'], ['one', 'two']],
labels=[[0, 0], [0, 1]],
names=['first', 'second'])
but
df.loc[:, ['bar', 'baz']].columns
returns
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['first', 'second'])
Any reason why the second one would return ALL the column names instead of
MultiIndex(levels=[['bar', 'baz'], ['one', 'two']] etc...
And more importantly any quick fix so I can return only the relevant data?
This is made even more relevant as pandas are deprecating Panels (which used to be a very elegant way to store multidimensional data)
Upvotes: 2
Views: 134
Reputation: 862406
In new pandas version (0.20.1
) use MultiIndex.remove_unused_levels
:
print (df.loc[:, ['bar', 'baz']].columns)
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['first', 'second'])
print (df.loc[:, ['bar', 'baz']].columns.remove_unused_levels())
MultiIndex(levels=[['bar', 'baz'], ['one', 'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=['first', 'second'])
Advanced shown levels in documentation.
Upvotes: 2