tktk
tktk

Reputation: 11734

pandas: Boolean indexing with multi index

There are many questions here with similar titles, but I couldn't find one that's addressing this issue.

I have dataframes from many different origins, and I want to filter one by the other. Using boolean indexing works great when the boolean series is the same size as the filtered dataframe, but not when the size of the series is the same as a higher level index of the filtered dataframe.

In short, let's say I have this dataframe:

In [4]: df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 
                           'b':[1,2,3,1,2,3,1,2,3], 
                           'c':range(9)}).set_index(['a', 'b'])
Out[4]: 
     c
a b   
1 1  0
  2  1
  3  2
2 1  3
  2  4
  3  5
3 1  6
  2  7
  3  8

And this series:

In [5]: filt = pd.Series({1:True, 2:False, 3:True})
Out[6]: 
1     True
2    False
3     True
dtype: bool

And the output I want is this:

     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

I am not looking for solutions that are not using the filt series, such as:

df[df.index.get_level_values('a') != 2]
df[df.index.get_level_values('a').isin([1,3])]

I want to know if I can use my input filt series as is, as I would use a filter on c:

filt = df.c < 7
df[filt]

Upvotes: 15

Views: 10828

Answers (8)

Attila the Fun
Attila the Fun

Reputation: 417

Building on @Markus Dutschke's answer, note that the IndexSlice object can be created just once and then used over and over (even to slice up different objects). I find this creates more readable code, especially when using it twice to slice on both MultiIndex rows and columns in the same .loc.

Applying this to his answer and simplifying slightly (no need for .values):

idx = pd.IndexSlice
df.loc[idx[filt[filt].index, :], :]

or the full code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
idx = pd.IndexSlice

print(df.loc[idx[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[idx[filt[filt].index, :], :])

Upvotes: 0

Literal
Literal

Reputation: 827

Not sure how fast/slow it would be on a large-scale dataframe, but what I sometimes do is

df.loc[filt[filt].index]

The problem is that the loc method only works with boolean inputs on a 1D index. If you provide the values of the first level elements you want to retain, you're good to go. So by filtering filt with itself (since it's on a 1D index) and keeping the values from its index, you achieve your goal.

Upvotes: 0

Yaroslav
Yaroslav

Reputation: 1271

The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:

df.loc[filt.reindex(df.index, level='a')]

Upvotes: 2

Markus Dutschke
Markus Dutschke

Reputation: 10606

You can use pd.IndexSlicer.

>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

where filt[filt].index.values is just [1, 3]. In other words

>>> df.loc[pd.IndexSlice[[1, 3], :]]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values] is, that you have more control over the indexing.

The topic of multiindex slicing is covered in more depth here.

Here the full code

import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})

print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])

Upvotes: 3

EliadL
EliadL

Reputation: 7068

Simply:

df.where(
    filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)

Explanation:

  • .rename_axis('a') renames the index as a (the index we want to filter by)
  • .rename('c') renames the column as c (the column that stores the values)
  • .to_frame() converts this Series into a DataFrame, for compatibility with df
  • df.where(...) filters the rows, leaving missing values (NaN) where filter is False
  • .drop_na() removes the rows with missing values (in our case where a == 2)
  • .astype(int) converts from float back to int (not sure why float to begin with)

By the way, it seems that df.where(...) and df[...] behave similarly here, so take your pick.

Upvotes: 0

Dennis Golomazov
Dennis Golomazov

Reputation: 17319

I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols, B = 100M rows x 3 cols. B has an MultiIndex, where the first (high) level is equal to the index of A. Let C be a slice from A of size 10k rows. My task was to get rows from B whose high-level index matches indexes of C as fast as possible. C is selected at runtime. A and B are static.

I tried the solutions from here: get_level_values takes many seconds, df.align didn't even finish giving MemoryError (and also took seconds).

The solution which worked for me (in ~300msec during runtime) is the following:

  1. For each index value i from A, find the first and the last (non-inclusive) positional indexes in B which contain i as the first level of MultiIndex. Store these pairs in A. This is done once and in advance. Example code:

    def construct_position_indexes(A, B):
        indexes = defaultdict(list)
        prev_index = 0
        for i, cur_index in enumerate(B.index.get_level_values(0)):
            if cur_index != prev_index:
                indexes[cur_index].append(i)
                if prev_index:
                    indexes[prev_index].append(i)
            prev_index = cur_index
        indexes[cur_index].append(i+1)
        index_df = pd.DataFrame(indexes.values(),
                                index=indexes.keys(),
                                columns=['start_index', 'end_index'], dtype=int)
        A = A.join(index_df)
        # they become floats, so we fix that
        A['start_index'] = A.start_index.fillna(0).astype(int)
        A['end_index'] = A.end_index.fillna(0).astype(int)
        return A
    
  2. At runtime, get positional boundaries from C and construct a list of all positional indexes to search for in B, and pass them to B.take():

    def get_slice(B, C):
        all_indexes = []
        for start_index, end_index in zip(
                C.start_index.values, C.end_index.values):
            all_indexes.extend(range(start_index, end_index))
        return B.take(all_indexes)
    

I hope it's not too complicated. Essentially, the idea is for each row in A store the range of corresponding (positional) indexes of rows in B, so that at runtime we can quickly construct the list of all positional indexes to query B by.

This is a toy example:

A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A

    dataA
A0      0
A1      1
A2      2

mindex = pd.MultiIndex.from_tuples([
    ('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'), 
    ('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B

       dataB
A0 B0      0
   B1      1
A1 B0      2
A2 B0      3
   B1      4
   B3      5

A = construct_position_indexes(A, B)
print A

    dataA  start_index  end_index
A0      0            0          2
A1      1            2          3
A2      2            3          6

C = A.iloc[[0, 2], :]
print C

    dataA  start_index  end_index
A0      0            0          2
A2      2            3          6

print get_slice(B, C)

       dataB
A0 B0      0
   B1      1
A2 B0      3
   B1      4
   B3      5

Upvotes: 0

Emanuele Paolini
Emanuele Paolini

Reputation: 10162

If you transform your index 'a' back to a column, you can do it as follows:

>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 
                       'b':[1,2,3,1,2,3,1,2,3], 
                       'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
   a  b  c
0  1  1  0
1  1  2  1
2  1  3  2
6  3  1  6
7  3  2  7
8  3  3  8

edit. As suggested by @joris, this works also with indices. Here is the code for your sample data:

>>> df[filt[df.index.get_level_values('a')].values]
     c
a b   
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

Upvotes: 8

joris
joris

Reputation: 139142

If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align:

In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)

In [26]: filt_aligned
Out[26]:
         0
a b
1 1   True
  2   True
  3   True
2 1  False
  2  False
  3  False
3 1   True
  2   True
  3   True

And then you can index with it:

In [27]: df[filt_aligned[0]]
Out[27]:
     c
a b
1 1  0
  2  1
  3  2
3 1  6
  2  7
  3  8

Note: the align didn't work with a Series, therefore the to_frame in the align call, and therefore the [0] above to get back the series.

Upvotes: 3

Related Questions