Cheng
Cheng

Reputation: 17894

Pandas how to perform conditional selection on MultiIndex

Here is the sample data file, and I performed the following operation in ipython notebook:

!curl -O http://pbpython.com/extras/sales-funnel.xlsx

df = pd.read_excel('./sales-funnel.xlsx')
df['Status'] = df['Status'].astype('category')
df["Status"].cat.set_categories(["won","pending","presented","declined"],inplace=True)

table = pd.pivot_table(df,
               index=['Manager', 'Status'],
               values=['Price', 'Quantity'],
               columns=['Product'],
               aggfunc={'Price':[np.sum, np.mean], 'Quantity':len},
               fill_value=0
              )

This is what the data looks like in table:

enter image description here

I want to select (Manager=="Debra Henley") & (Status=="won") and it works with the query method:

table.query('(Manager=="Debra Henley") & (Status=="won")')

But how do you perform the same selection with loc? I tried this but does not work:

table.loc[['Debra Henley', 'won']]

What do you guys usually use when dealing with MultiIndex? What's the best way to do it?


Update: found two solutions so far:

table.xs(('Debra Henley','won'), level=('Manager', 'Status'))
table.loc[[('Debra Henley', 'won')]]

So I guess tuples should be used instead of lists when indexing with MultiIndex?

Upvotes: 6

Views: 5898

Answers (3)

jezrael
jezrael

Reputation: 862481

For simplier selections (only index or only columns) use xs approach or selecting by tuples.

Another more general solution with slicers:

idx = pd.IndexSlice
#output is df
print (table.loc[[idx['Debra Henley','won']]])
                    Quantity                               Price              \
                         len                                mean               
Product                  CPU Maintenance Monitor Software    CPU Maintenance   
Manager      Status                                                            
Debra Henley won           1           0       0        0  65000           0   


                                        sum                               
Product             Monitor Software    CPU Maintenance Monitor Software  
Manager      Status                                                       
Debra Henley won          0        0  65000           0       0        0

idx = pd.IndexSlice
#output is series
print (table.loc[idx['Debra Henley','won'],:])
Quantity  len   CPU                1
                Maintenance        0
                Monitor            0
                Software           0
Price     mean  CPU            65000
                Maintenance        0
                Monitor            0
                Software           0
          sum   CPU            65000
                Maintenance        0
                Monitor            0
                Software           0
Name: (Debra Henley, won), dtype: int64

But it is better for more complicated selections - if need filter index and columns together - one xs doesnt work:

idx = pd.IndexSlice
#select all rows where first level is Debra Henley in index and 
#in columns second level is len and sum
print (table.loc[idx['Debra Henley',:], idx[:, ['len', 'sum'], :]])
                       Quantity                               Price  \
                            len                                 sum   
Product                     CPU Maintenance Monitor Software    CPU   
Manager      Status                                                   
Debra Henley won              1           0       0        0  65000   
             pending          1           2       0        0  40000   
             presented        1           0       0        2  30000   
             declined         2           0       0        0  70000   



Product                Maintenance Monitor Software  
Manager      Status                                  
Debra Henley won                 0       0        0  
             pending         10000       0        0  
             presented           0       0    20000  
             declined            0       0        0     

Upvotes: 3

piRSquared
piRSquared

Reputation: 294218

Your canonical answer is provided by @ScottBoston.

I'll add this for breadth and perspective in addition to @jezrael's IndexSlice approach.
You can also use pd.DataFrame.xs to take a cross-section

table.xs(['Debra Henley', 'won'])

                Product    
Quantity  len   CPU                1
                Maintenance        0
                Monitor            0
                Software           0
Price     mean  CPU            65000
                Maintenance        0
                Monitor            0
                Software           0
          sum   CPU            65000
                Maintenance        0
                Monitor            0
                Software           0
Name: (Debra Henley, won), dtype: int64

Upvotes: 6

Scott Boston
Scott Boston

Reputation: 153460

Yes, you can use:

table.loc[[('Debra Henley', 'won')]]

to return a pandas data frame or you can use:

table.loc[('Debra Henley','won')]

to return a pandas series.

You can refer to the this documentation.

Upvotes: 1

Related Questions