Reputation: 304
I have a fairly large DataFrame, say 600 indexes, and want to use filter criteria to produce a reduced version of the DataFrame where the criteria is true. From the research I've done, filtering works well when you're applying expressions to the data, and already know the index you're operating on. What I want to do, however, is apply the filtering criteria to the index. See example below.
MultiIndex is bold, names of MultiIndex names are italic.
I'd like to apply the criteria like follows (or something) along these lines:
df = df[MultiIndex.query('base == 115 & Al.isin(stn)')]
Then maybe do something like this:
df = df.transpose()[MultiIndex.query('Fault.isin(cont)')].transpose
To result in:
I think fundamentally I'm trying to produce a boolean list to mask the MultiIndex. If there is a quick way to apply the pandas query to a 2d list? that would be acceptable. As of now it seems like an option would be to take the MultiIndex, convert it to a DataFrame, then I can apply filtering as I want to get the TF array. I'm concerned that this will be slow though.
Upvotes: 1
Views: 939
Reputation: 8908
If what you're after is using the df.query()
nifty syntax to slice your data, then you're better off "unpivoting" your DataFrame, turning all indices and column labels into regular fields.
You can create an "unpivot" DataFrame with:
df_unpivot = df.stack(level=[0, 1]).rename('value').reset_index()
Which will produce a DataFrame that looks like this:
season cont stn base value
0 Summer Fault Alpha 115 1.0
1 Summer Fault Beta 115 0.8
2 Summer Fault Gamma 230 0.7
3 Summer Trip Alpha 115 1.2
4 Summer Trip Beta 115 0.9
...
Which you can then query with:
df_unpivot.query(
'cont.str.contains("Fault") and '
'stn.str.contains("Al") and '
'base == 115'
)
Which produces:
season cont stn base value
0 Summer Fault Alpha 115 1.0
6 Winter Fault Alpha 115 0.7
Which is the two values you were expecting.
Upvotes: 1
Reputation: 8908
As you noticed, indexes aren't great for querying using filter expressions. There's df.filter()
but it doesn't really seem to work well on a MultiIndex.
You can still filter the MultiIndex values as an iterable of Python tuples, and then use .loc
to access the filtered results.
This works:
rows = [(season, cont)
for (season, cont) in df.index
if 'Fault' in cont]
cols = [(stn, base)
for (stn, base) in df.columns
if base == 115 and 'Al' in stn]
df.loc[rows, cols]
Upvotes: 1