capitalistcuttle
capitalistcuttle

Reputation: 1823

How can I select combinations of levels in a pandas Multindex?

I have the following dataframe:

import numpy as np
import pandas as pd

index = pd.MultiIndex.from_product([[1, 2], ['a', 'b', 'c'], ['a', 'b', 'c']],
                                   names=['one', 'two', 'three'])

df = pd.DataFrame(np.random.rand(18, 3), index=index)

            0           1           2
one two three           
1   a   b   0.002568    0.390393    0.040717
        c   0.943853    0.105594    0.738587
    b   b   0.049197    0.500431    0.001677
        c   0.615704    0.051979    0.191894
2   a   b   0.748473    0.479230    0.042476
        c   0.691627    0.898222    0.252423
    b   b   0.270330    0.909611    0.085801
        c   0.913392    0.519698    0.451158

I want to select rows where combination of index levels two and three are (a, b) or (b, c). How can I do this?

I tried df.loc[(slice(None), ['a', 'b'], ['b', 'c']), :] but that gives me all combinations of [a, b] and [b, c], including (a, c) and (b, b), which aren't needed.

I tried df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])] but that returns NaN in level one of the index.

df.loc[pd.MultiIndex.from_tuples([(None, 'a', 'b'), (None, 'b', 'c')])]

            0   1   2
NaN a   b   NaN NaN NaN
    b   c   NaN NaN NaN

So I thought I needed a slice at level one, but that gives me a TypeError:

pd.MultiIndex.from_tuples([(slice(None), 'a', 'b'), (slice(None), 'b', 'c')])

TypeError: unhashable type: 'slice'

I feel like I'm missing some simple one-liner here :).

Upvotes: 2

Views: 171

Answers (2)

Zero
Zero

Reputation: 76997

Here's one approach with loc and get_level_values

In [3231]: idx = df.index.get_level_values

In [3232]: df.loc[((idx('two') == 'a') & (idx('three') == 'b')) |
                  ((idx('two') == 'b') & (idx('three') == 'c'))]
Out[3232]:
                      0         1         2
one two three
1   a   b      0.442332  0.380669  0.832598
    b   c      0.458145  0.017310  0.068655
2   a   b      0.933427  0.148962  0.569479
    b   c      0.727993  0.172090  0.384461

Generic way

In [3262]: conds = [('a', 'b'), ('b', 'c')]

In [3263]: mask = np.column_stack(
                      [(idx('two') == c[0]) & (idx('three') == c[1]) for c in conds]
                    ).any(1)

In [3264]: df.loc[mask]
Out[3264]:
                      0         1         2
one two three
1   a   b      0.442332  0.380669  0.832598
    b   c      0.458145  0.017310  0.068655
2   a   b      0.933427  0.148962  0.569479
    b   c      0.727993  0.172090  0.384461

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210932

Use df.query():

In [174]: df.query("(two=='a' and three=='b') or (two=='b' and three=='c')")
Out[174]:
                      0         1         2
one two three
1   a   b      0.211555  0.193317  0.623895
    b   c      0.685047  0.369135  0.899151
2   a   b      0.082099  0.555929  0.524365
    b   c      0.901859  0.068025  0.742212

UPDATE: we can also generate such "query" dynamically:

In [185]: l = [('a','b'), ('b','c')]

In [186]: q = ' or '.join(["(two=='{}' and three=='{}')".format(x,y) for x,y in l])

In [187]: q
Out[187]: "(two=='a' and three=='b') or (two=='b' and three=='c')"

In [188]: df.query(q)
Out[188]:
                      0         1         2
one two three
1   a   b      0.211555  0.193317  0.623895
    b   c      0.685047  0.369135  0.899151
2   a   b      0.082099  0.555929  0.524365
    b   c      0.901859  0.068025  0.742212

Upvotes: 2

Related Questions