Reputation: 11734
There are many questions here with similar titles, but I couldn't find one that's addressing this issue.
I have dataframes from many different origins, and I want to filter one by the other. Using boolean indexing works great when the boolean series is the same size as the filtered dataframe, but not when the size of the series is the same as a higher level index of the filtered dataframe.
In short, let's say I have this dataframe:
In [4]: df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)}).set_index(['a', 'b'])
Out[4]:
c
a b
1 1 0
2 1
3 2
2 1 3
2 4
3 5
3 1 6
2 7
3 8
And this series:
In [5]: filt = pd.Series({1:True, 2:False, 3:True})
Out[6]:
1 True
2 False
3 True
dtype: bool
And the output I want is this:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
I am not looking for solutions that are not using the filt
series, such as:
df[df.index.get_level_values('a') != 2]
df[df.index.get_level_values('a').isin([1,3])]
I want to know if I can use my input filt
series as is, as I would use a filter on c:
filt = df.c < 7
df[filt]
Upvotes: 15
Views: 10828
Reputation: 417
Building on @Markus Dutschke's answer, note that the IndexSlice object can be created just once and then used over and over (even to slice up different objects). I find this creates more readable code, especially when using it twice to slice on both MultiIndex rows and columns in the same .loc
.
Applying this to his answer and simplifying slightly (no need for .values):
idx = pd.IndexSlice
df.loc[idx[filt[filt].index, :], :]
or the full code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
idx = pd.IndexSlice
print(df.loc[idx[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[idx[filt[filt].index, :], :])
Upvotes: 0
Reputation: 827
Not sure how fast/slow it would be on a large-scale dataframe, but what I sometimes do is
df.loc[filt[filt].index]
The problem is that the loc
method only works with boolean inputs on a 1D index. If you provide the values of the first level elements you want to retain, you're good to go. So by filtering filt
with itself (since it's on a 1D index) and keeping the values from its index, you achieve your goal.
Upvotes: 0
Reputation: 1271
The more readable (to my liking) solution is to reindex the boolean series (dataframe) to match index of the multi-index df:
df.loc[filt.reindex(df.index, level='a')]
Upvotes: 2
Reputation: 10606
You can use pd.IndexSlicer.
>>> df.loc[pd.IndexSlice[filt[filt].index.values, :], :]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
where filt[filt].index.values
is just [1, 3]
. In other words
>>> df.loc[pd.IndexSlice[[1, 3], :]]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
so if you design your filter construction a bit differently, the expression gets shorter. The advantave over Emanuele Paolini's solution df[filt[df.index.get_level_values('a')].values]
is, that you have more control over the indexing.
The topic of multiindex slicing is covered in more depth here.
Here the full code
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3], 'b':[1,2,3,1,2,3,1,2,3], 'c':range(9)}).set_index(['a', 'b'])
filt = pd.Series({1:True, 2:False, 3:True})
print(df.loc[pd.IndexSlice[[1, 3], :]])
print(df.loc[(df.index.levels[0].values[filt], slice(None)), :])
print(df.loc[pd.IndexSlice[filt[filt].index.values, :], :])
Upvotes: 3
Reputation: 7068
Simply:
df.where(
filt.rename_axis('a').rename('c').to_frame()
).dropna().astype(int)
Explanation:
.rename_axis('a')
renames the index as a
(the index we want to filter by).rename('c')
renames the column as c
(the column that stores the values).to_frame()
converts this Series into a DataFrame, for compatibility with df
df.where(...)
filters the rows, leaving missing values (NaN
) where filter is False
.drop_na()
removes the rows with missing values (in our case where a == 2
).astype(int)
converts from float
back to int
(not sure why float
to begin with)By the way, it seems that df.where(...)
and df[...]
behave similarly here, so take your pick.
Upvotes: 0
Reputation: 17319
I was facing exactly the same problem. I found this question and tried the solutions here, but none of them was efficient enough. My dataframes are: A = 700k rows x 14 cols
, B = 100M rows x 3 cols
. B
has an MultiIndex
, where the first (high) level is equal to the index of A
. Let C
be a slice from A
of size 10k
rows. My task was to get rows from B
whose high-level index matches indexes of C
as fast as possible. C
is selected at runtime. A
and B
are static.
I tried the solutions from here: get_level_values
takes many seconds, df.align
didn't even finish giving MemoryError
(and also took seconds).
The solution which worked for me (in ~300msec
during runtime) is the following:
For each index
value i
from A
, find the first and the last (non-inclusive) positional indexes in B
which contain i
as the first level of MultiIndex. Store these pairs in A
. This is done once and in advance.
Example code:
def construct_position_indexes(A, B):
indexes = defaultdict(list)
prev_index = 0
for i, cur_index in enumerate(B.index.get_level_values(0)):
if cur_index != prev_index:
indexes[cur_index].append(i)
if prev_index:
indexes[prev_index].append(i)
prev_index = cur_index
indexes[cur_index].append(i+1)
index_df = pd.DataFrame(indexes.values(),
index=indexes.keys(),
columns=['start_index', 'end_index'], dtype=int)
A = A.join(index_df)
# they become floats, so we fix that
A['start_index'] = A.start_index.fillna(0).astype(int)
A['end_index'] = A.end_index.fillna(0).astype(int)
return A
At runtime, get positional boundaries from C
and construct a list of all positional indexes to search for in B
, and pass them to B.take()
:
def get_slice(B, C):
all_indexes = []
for start_index, end_index in zip(
C.start_index.values, C.end_index.values):
all_indexes.extend(range(start_index, end_index))
return B.take(all_indexes)
I hope it's not too complicated. Essentially, the idea is for each row in A
store the range of corresponding (positional) indexes of rows in B
, so that at runtime we can quickly construct the list of all positional indexes to query B
by.
This is a toy example:
A = pd.DataFrame(range(3), columns=['dataA'], index=['A0', 'A1', 'A2'])
print A
dataA
A0 0
A1 1
A2 2
mindex = pd.MultiIndex.from_tuples([
('A0', 'B0'), ('A0', 'B1'), ('A1', 'B0'),
('A2', 'B0'), ('A2', 'B1'), ('A2', 'B3')])
B = pd.DataFrame(range(6), columns=['dataB'], index=mindex)
print B
dataB
A0 B0 0
B1 1
A1 B0 2
A2 B0 3
B1 4
B3 5
A = construct_position_indexes(A, B)
print A
dataA start_index end_index
A0 0 0 2
A1 1 2 3
A2 2 3 6
C = A.iloc[[0, 2], :]
print C
dataA start_index end_index
A0 0 0 2
A2 2 3 6
print get_slice(B, C)
dataB
A0 B0 0
B1 1
A2 B0 3
B1 4
B3 5
Upvotes: 0
Reputation: 10162
If you transform your index 'a' back to a column, you can do it as follows:
>>> df = pd.DataFrame({'a':[1,1,1,2,2,2,3,3,3],
'b':[1,2,3,1,2,3,1,2,3],
'c':range(9)})
>>> filt = pd.Series({1:True, 2:False, 3:True})
>>> df[filt[df['a']].values]
a b c
0 1 1 0
1 1 2 1
2 1 3 2
6 3 1 6
7 3 2 7
8 3 3 8
edit. As suggested by @joris, this works also with indices. Here is the code for your sample data:
>>> df[filt[df.index.get_level_values('a')].values]
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Upvotes: 8
Reputation: 139142
If the boolean series is not aligned with the dataframe you want to index it with, you can first explicitely align it with align
:
In [25]: df_aligned, filt_aligned = df.align(filt.to_frame(), level=0, axis=0)
In [26]: filt_aligned
Out[26]:
0
a b
1 1 True
2 True
3 True
2 1 False
2 False
3 False
3 1 True
2 True
3 True
And then you can index with it:
In [27]: df[filt_aligned[0]]
Out[27]:
c
a b
1 1 0
2 1
3 2
3 1 6
2 7
3 8
Note: the align didn't work with a Series, therefore the to_frame
in the align call, and therefore the [0]
above to get back the series.
Upvotes: 3