jgg
jgg

Reputation: 831

How to slice into a MultiIndex Pandas DataFrame?

Suppose you have the following data frame:

In [1]: import pandas as pd
In [2]: index = [('California',2000),('California', 2010), ('New York', 2000),
 ('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas',2010)]
In [3]: populations = [33871648, 37253956,189765457,19378102,20851820,25145561
     ...: ]
In [4]: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
In [5]: pop_df
Out[5]:
                         Data
(California, 2000)   33871648
(California, 2010)   37253956
(New York, 2000)    189765457
(New York, 2010)     19378102
(Texas, 2000)        20851820
(Texas, 2010)        25145561

How can one index into this dataframe to get all of the California data? I tried the following and got a key error pop_df[('California,)] . So then I executed the following and still got a key error:

In [6]: index2 = pd.MultiIndex.from_tuples(index)
In [7]: pop_df2 = pop_df.reindex(index2)
In [8]: pop_df2
Out[8]:
                      Data
California 2000   33871648
           2010   37253956
New York   2000  189765457
           2010   19378102
Texas      2000   20851820
           2010   25145561

In [9]: pop_df2['California']

pop_df2['California']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3079             try:
-> 3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'California'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-141-18a1a54664b0> in <module>
----> 1 pop_df2['California']

~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3022             if self.columns.nlevels > 1:
   3023                 return self._getitem_multilevel(key)
-> 3024             indexer = self.columns.get_loc(key)
   3025             if is_integer(indexer):
   3026                 indexer = [indexer]

~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3080                 return self._engine.get_loc(casted_key)
   3081             except KeyError as err:
-> 3082                 raise KeyError(key) from err
   3083
   3084         if tolerance is not None:

KeyError: 'California'

What is the right way to index into a multiindex dataframe?

Upvotes: 1

Views: 4104

Answers (4)

Inputvector
Inputvector

Reputation: 1093

Here is a solution. You need to indicate level:

pop_df2[pop_df2.index.get_level_values(0) == 'California']

#Output:
                     Data
California  2000    33871648
            2010    37253956

Upvotes: 1

BENY
BENY

Reputation: 323226

Try with IndexSlice

pop_df2.loc[pd.IndexSlice[['California'],],]
Out[52]: 
                     Data
California 2000  33871648
           2010  37253956

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

df['somename'] looks for columns, df.loc['somename'] looks for index. You want:

pop_df2.loc['California']

Output:

          Data
2000  33871648
2010  37253956

You also have xs option, which allows slicing on different level, and also keeping the full index hierarchy:

# default `drop_level` is True
# which behave like `.loc` on top level
pop_df.xs('California', level=0, drop_level=False)

Output:

                     Data
California 2000  33871648
           2010  37253956

Or xs on second level:

pop_df.xs(2010, level=1, drop_level=False)

gives you:

                     Data
California 2010  37253956
New York   2010  19378102
Texas      2010  25145561

Upvotes: 3

user1717828
user1717828

Reputation: 7225

You want .loc[]. Without it, you are looking for a column named 'California', not an index label.

By the way, you had a typo in your input where you were duplicating an index entry. Here is the full code.

In [1]: import pandas as pd
   ...: index = [
   ...: ('California',2000),
   ...: ('California', 2010),
   ...: ('New York', 2000),
   ...: ('New York', 2010),
   ...: ('Texas', 2000),
   ...: ('Texas',2010)
   ...: ]
   ...: populations = [33871648, 37253956,189765457,19378102,20851820,25145561]
   ...: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
   ...: index2 = pd.MultiIndex.from_tuples(index)
   ...: pop_df2 = pop_df.reindex(index2)
   ...: pop_df2.loc['California']
Out[1]: 
          Data
2000  33871648
2010  37253956

Upvotes: 2

Related Questions