likethevegetable
likethevegetable

Reputation: 304

Filter MultiIndex with Query Strings

I have a fairly large DataFrame, say 600 indexes, and want to use filter criteria to produce a reduced version of the DataFrame where the criteria is true. From the research I've done, filtering works well when you're applying expressions to the data, and already know the index you're operating on. What I want to do, however, is apply the filtering criteria to the index. See example below.

MultiIndex is bold, names of MultiIndex names are italic.

enter image description here

I'd like to apply the criteria like follows (or something) along these lines:

df = df[MultiIndex.query('base == 115 & Al.isin(stn)')]

Then maybe do something like this:

df = df.transpose()[MultiIndex.query('Fault.isin(cont)')].transpose

To result in:

enter image description here

I think fundamentally I'm trying to produce a boolean list to mask the MultiIndex. If there is a quick way to apply the pandas query to a 2d list? that would be acceptable. As of now it seems like an option would be to take the MultiIndex, convert it to a DataFrame, then I can apply filtering as I want to get the TF array. I'm concerned that this will be slow though.

Upvotes: 1

Views: 939

Answers (2)

filbranden
filbranden

Reputation: 8908

If what you're after is using the df.query() nifty syntax to slice your data, then you're better off "unpivoting" your DataFrame, turning all indices and column labels into regular fields.

You can create an "unpivot" DataFrame with:

df_unpivot = df.stack(level=[0, 1]).rename('value').reset_index()

Which will produce a DataFrame that looks like this:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
1 Summer Fault Beta   115   0.8
2 Summer Fault Gamma  230   0.7
3 Summer Trip  Alpha  115   1.2
4 Summer Trip  Beta   115   0.9
...

Which you can then query with:

df_unpivot.query(
    'cont.str.contains("Fault") and '
    'stn.str.contains("Al") and '
    'base == 115'
)

Which produces:

  season cont  stn   base value
0 Summer Fault Alpha  115   1.0
6 Winter Fault Alpha  115   0.7

Which is the two values you were expecting.

Upvotes: 1

filbranden
filbranden

Reputation: 8908

As you noticed, indexes aren't great for querying using filter expressions. There's df.filter() but it doesn't really seem to work well on a MultiIndex.

You can still filter the MultiIndex values as an iterable of Python tuples, and then use .loc to access the filtered results.

This works:

rows = [(season, cont)
        for (season, cont) in df.index
        if 'Fault' in cont]
cols = [(stn, base)
        for (stn, base) in df.columns
        if base == 115 and 'Al' in stn]
df.loc[rows, cols]

Upvotes: 1

Related Questions