Selecting rows from pandas by subset of multiindex

Question

I have a multiindex dataframe in pandas, with 4 columns in the index, and some columns of data. An example is below:

import pandas as pd
import numpy as np
cnames = ['K1', 'K2', 'K3', 'K4', 'D1', 'D2']
rdata = pd.DataFrame(np.random.randint(1, 3, size=(8, len(cnames))), columns=cnames)
rdata.set_index(cnames[:4], inplace=True)
rdata.sortlevel(inplace=True)
print(rdata)
             D1  D2
K1 K2 K3 K4        
1  1  1  1    1   2
         1    1   2
      2  1    2   1
   2  1  2    2   1
      2  1    2   1
2  1  2  2    2   1
   2  1  2    1   1
         2    1   1

[8 rows x 2 columns]

What I want to do is select the rows where there are exactly 2 values at the K3 level. Not 2 rows, but two distinct values. I've found how to generate a sort of mask for what I want:

filterFunc = lambda x: len(set(x.index.get_level_values('K3'))) == 2
mask = rdata.groupby(level=cnames[:2]).apply(filterFunc)
print(mask)
K1  K2
1   1      True
    2      True
2   1     False
    2     False
dtype: bool

And I'd hoped that since rdata.loc[1, 2] allows you to match on just part of the index, it would be possible to do the same thing with a boolean vector like this. Unfortunately, rdata.loc[mask] fails with IndexingError: Unalignable boolean Series key provided.

This question seemed similar, but the answer given there doesn't work for anything other than the top level index, since index.get_level_values only works on a single level, not multiple ones.

Following the suggestion here I managed to accomplish what I wanted with

rdata[[mask.loc[k1, k2] for k1, k2, k3, k4 in rdata.index]]

however, both getting the count of distinct values using len(set(index.get_level_values(...))) and building the boolean vector afterwards by iterating over every row feels more like I'm fighting the framework to achieve something that seems like a simple task in a multiindex setup. Is there a better solution?

This is using pandas 0.13.1.

unutbu · Accepted Answer

There might be something better, but you could at least bypass defining mask by using groupby-filter:

rdata.groupby(level=cnames[:2]).filter(
      lambda grp: (grp.index.get_level_values('K3')
                      .unique().size) == 2)

Out[83]: 
             D1  D2
K1 K2 K3 K4        
1  1  1  1    1   2
         1    1   2
      2  1    2   1
   2  1  2    2   1
      2  1    2   1

[5 rows x 2 columns]

It is faster than my previous suggestions. It does really well for small DataFrames:

In [84]: %timeit rdata.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
100 loops, best of 3: 3.84 ms per loop

In [76]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
100 loops, best of 3: 11.9 ms per loop

In [77]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
100 loops, best of 3: 13.4 ms per loop

and is still the fastest for large DataFrames, though not by as much:

In [78]: rdata2 = pd.concat([rdata]*100000)

In [85]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
1 loops, best of 3: 756 ms per loop

In [79]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
1 loops, best of 3: 772 ms per loop

In [80]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
1 loops, best of 3: 1 s per loop

Selecting rows from pandas by subset of multiindex

Answers (1)

Related Questions