Reputation: 831
Suppose you have the following data frame:
In [1]: import pandas as pd
In [2]: index = [('California',2000),('California', 2010), ('New York', 2000),
('New York', 2000), ('New York', 2010), ('Texas', 2000), ('Texas',2010)]
In [3]: populations = [33871648, 37253956,189765457,19378102,20851820,25145561
...: ]
In [4]: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
In [5]: pop_df
Out[5]:
Data
(California, 2000) 33871648
(California, 2010) 37253956
(New York, 2000) 189765457
(New York, 2010) 19378102
(Texas, 2000) 20851820
(Texas, 2010) 25145561
How can one index into this dataframe to get all of the California data? I tried the following and got a key error pop_df[('California,)]
. So then I executed the following and still got a key error:
In [6]: index2 = pd.MultiIndex.from_tuples(index)
In [7]: pop_df2 = pop_df.reindex(index2)
In [8]: pop_df2
Out[8]:
Data
California 2000 33871648
2010 37253956
New York 2000 189765457
2010 19378102
Texas 2000 20851820
2010 25145561
In [9]: pop_df2['California']
pop_df2['California']
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3079 try:
-> 3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'California'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-141-18a1a54664b0> in <module>
----> 1 pop_df2['California']
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3022 if self.columns.nlevels > 1:
3023 return self._getitem_multilevel(key)
-> 3024 indexer = self.columns.get_loc(key)
3025 if is_integer(indexer):
3026 indexer = [indexer]
~/opt/miniconda3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3080 return self._engine.get_loc(casted_key)
3081 except KeyError as err:
-> 3082 raise KeyError(key) from err
3083
3084 if tolerance is not None:
KeyError: 'California'
What is the right way to index into a multiindex dataframe?
Upvotes: 1
Views: 4104
Reputation: 1093
Here is a solution. You need to indicate level:
pop_df2[pop_df2.index.get_level_values(0) == 'California']
#Output:
Data
California 2000 33871648
2010 37253956
Upvotes: 1
Reputation: 323226
Try with IndexSlice
pop_df2.loc[pd.IndexSlice[['California'],],]
Out[52]:
Data
California 2000 33871648
2010 37253956
Upvotes: 2
Reputation: 150735
df['somename']
looks for columns, df.loc['somename']
looks for index. You want:
pop_df2.loc['California']
Output:
Data
2000 33871648
2010 37253956
You also have xs
option, which allows slicing on different level, and also keeping the full index hierarchy:
# default `drop_level` is True
# which behave like `.loc` on top level
pop_df.xs('California', level=0, drop_level=False)
Output:
Data
California 2000 33871648
2010 37253956
Or xs
on second level:
pop_df.xs(2010, level=1, drop_level=False)
gives you:
Data
California 2010 37253956
New York 2010 19378102
Texas 2010 25145561
Upvotes: 3
Reputation: 7225
You want .loc[]
. Without it, you are looking for a column named 'California', not an index label.
By the way, you had a typo in your input where you were duplicating an index entry. Here is the full code.
In [1]: import pandas as pd
...: index = [
...: ('California',2000),
...: ('California', 2010),
...: ('New York', 2000),
...: ('New York', 2010),
...: ('Texas', 2000),
...: ('Texas',2010)
...: ]
...: populations = [33871648, 37253956,189765457,19378102,20851820,25145561]
...: pop_df = pd.DataFrame(populations,index=index,columns=["Data"])
...: index2 = pd.MultiIndex.from_tuples(index)
...: pop_df2 = pop_df.reindex(index2)
...: pop_df2.loc['California']
Out[1]:
Data
2000 33871648
2010 37253956
Upvotes: 2