Reputation: 311
Suppose I have the following dataframe:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
s = pd.DataFrame(np.random.randn(8, 2), index=index, columns=[0, 1])
s
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
baz one -2.616540 -1.368694
two -1.989319 1.627848
foo one -0.404563 -1.099314
two -2.006166 0.867398
qux one -0.843150 -1.045291
two 2.129620 -2.697217
I know select a sub-dataframe by indexing:
temp = s.loc[('bar', slice(None)), slice(None)].copy()
temp
0 1
first second
bar one -0.012581 1.421286
two -0.048482 -0.153656
However, if I look at the index, the values of the original index still appear:
temp.index
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
This does not happen with normal dataframes. If you index, the remaining copy (or even the view) contains only the selected index/columns. This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index
df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
To be clear, I want the filtered dataframe to have the same structure (multiindex index and columns). For example, I want to do:
temp = s.loc[(('bar', 'foo'), slice(None)), :]
but the index still has 'baz' and 'qux' values:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 2, 2], [0, 1, 0, 1]],
names=[u'first', u'second'])
To make clear the effect I would like to see, I wrote this snippet to eliminate redundant entries:
import pandas as pd
def update_multiindex(df):
if isinstance(df.columns, pd.MultiIndex):
new_df = {key: df.loc[:, key] for key in df.columns if not df.loc[:, key].empty}
new_df = pd.DataFrame(new_df)
else:
new_df = df.copy()
if isinstance(df.index, pd.MultiIndex):
new_df = {key: new_df.loc[key, :] for key in new_df.index if not new_df.loc[key, :].empty}
new_df = pd.DataFrame(new_df).T
return new_df
temp = update_multiindex(temp).index
temp
MultiIndex(levels=[[u'bar', u'foo'], [u'one', u'two']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
Upvotes: 3
Views: 2069
Reputation: 4686
If I understand correctly your usage pattern you may be able to get the best of both worlds. I'm focusing on:
This is annoying because I might often do lots of filtering on big dataframes and at the end I would like to know the index of what's left by just doing
df.index df
This also happens for multiindex columns. Is there a proper way to update the index/columns and drop the empty entries?
Consideration (1) is that you want to know the index of what's left. Consideration (2) is that as mentioned above, if you trim the multiindex you can't merge any data back into your original, and also its a bunch of nonobvious steps that aren't really encouraged.
The underlying fundamental is that index does NOT return updated contents for a multiindex if any rows or columns have been deleted and this is not considered a bug because that's not the approved use of MultiIndexes (read more: github.com/pydata/pandas/issues/3686). The valid API access for the current contents of a MultiIndex is get_level_values.
So would it fit your needs to adjust your practice to use this?
df.index.get_level_values(-put your level name or number here-)
For Multiindexes this is the approved API access technique and there are some good reasons for this. If you use get_level_values instead of just .index you'll be able to get the current contents while ALSO preserving all the information in case you want to re-merge modified data or otherwise match against the original indices for comparisons, grouping, etc...
Does that fit your needs?
Upvotes: 1
Reputation: 2497
Two points. First, I think you may want to do something that is actually bad for you. I know it's annoying that you have a lot of extra cruft in your filtered indices, but if you rebuild the indices to exclude the missing categorical values, then your new indices will be incompatible with each other and the original index.
That said, I suspect (but do not know) that MultiIndex
used this way is built on top of CategoricalIndex
, which has the method remove_unused_levels(). It may be wrapped by MultiIndex
, but I cannot tell, because...
Second, MultiIndex
is notably missing from the pandas API documentation. I do not use MultiIndex
, but you might consider looking for and/or opening a ticket on GitHub about this if you do use it regularly. Beyond that, you may have to grunnel through the source code if you want to find exact information on the features available with MultiIndex
.
Upvotes: 2
Reputation: 109756
Try using droplevel
.
temp.index = temp.index.droplevel()
>>> temp
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808
>>> temp.index
Index([u'one', u'two'], dtype='object')
When dealing with columns, it's the same thing:
df.columns = df.columns.droplevel()
You can also use xs
and set the drop_level
parameter to True (default value is False):
>>> s.xs('bar', drop_level=True)
0 1
second
one 0.450819 -1.071271
two -0.371563 0.411808
Upvotes: 0
Reputation: 1171
There is a difference between the index of s
and the index of temp
:
In [25]: s.index
Out[25]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'first', u'second'])
In [26]: temp.index
Out[26]:
MultiIndex(levels=[[u'bar', u'baz', u'foo', u'qux'], [u'one', u'two']],
labels=[[0, 0], [0, 1]],
names=[u'first', u'second'])
Notices that the labels
in the MultiIndex
are different.
Upvotes: 0