Reputation: 145
I have a df with multi-indexed columns, like this:
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
I want to be able to select all rows where the values in one of the level 1 columns pass a certain test. If there were no multi-index on the columns I would say something like:
data[data['d']<1]
But of course that fails on a multindex. The level 1 indexes are unique, so I don't want to have to specify the level 0 index, just level 1. I'd like to return the table above but missing row 1, where d>1.
Upvotes: 5
Views: 346
Reputation: 28729
Using your supplied data, a combination of xs and squeeze can help with the filtering. This works on the assumption that the level 1 entries are unique, as indicated in your question :
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data
one two
a b c d e f
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Say you want to filter for d less than 1 :
#squeeze turns it into a series, making it easy to pass to loc via boolean indexing
condition = data.xs('d',axis=1,level=1).lt(1).squeeze()
#or you could use loc :
# condition = data.loc(axis=1)[:,'d'].lt(1).squeeze()
data.loc[condition]
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Upvotes: 2
Reputation: 145
Thanks everyone for your help. As usual with these things, the specific answer to the problem is not as interesting as what you've learned in trying to fix it, and I learned a lot about .query
, .xs
and much more.
However, I ended up taking a side route to addressing my specific issue - namely that I copied the columns to a new variable, dropped an index, did my calculations, then put the original indexes in place. Eg:
cols = data.columns
data..droplevel(level=1, axis=1)
# do calculations
data.columns = cols
The advantage was that I could top and tail the operation modifying the indexes, but all the data manipulation in between used idioms I'm familiar with.
At some point I'll sit down and read about multi-indexes at length.
Upvotes: 0
Reputation: 863531
If values are unique in second level id necessary convert mask from one column DataFrame to Series
- possible solution with DataFrame.squeeze
:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
print (data.xs('d', axis=1, level=1))
two
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
print (data.xs('d', axis=1, level=1).squeeze())
0 1.331864
1 0.953490
2 -0.189313
3 0.064969
Name: two, dtype: float64
print (data.xs('d', axis=1, level=1).squeeze().lt(1))
0 False
1 True
2 True
3 True
Name: two, dtype: bool
df = data[data.xs('d', axis=1, level=1).squeeze().lt(1)]
Alternative with DataFrame.iloc
:
df = data[data.xs('d', axis=1, level=1).iloc[:, 0].lt(1)]
print (df)
one two
a b c d e f
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
If working with MultiIndex
after select is possible get multiple columns, like here if select by c
level:
np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
So first select by DataFrame.xs
and compare by DataFrame.lt
for <
print (data.xs('c', axis=1, level=1))
two
0 1.481278 0.685609
1 -0.235634 -0.344943
2 0.244509 0.464802
3 -0.963336 1.054969
m = data.xs('c', axis=1, level=1).lt(1)
#alternative
#m = data.xs('c', axis=1, level=1) < 1
print (m)
two
0 False True
1 True True
2 True True
3 True False
And then test if at least one True
per rows by DataFrame.any
and filter by boolean indexing
:
df1 = data[m.any(axis=1)]
print (df1)
one two
a b c a b c
0 -0.217679 0.821455 1.481278 1.331864 -0.361865 0.685609
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
3 0.845930 -0.503542 -0.963336 0.064969 -3.205040 1.054969
Or test if all True
s per row by DataFrame.any
with filtering:
df1 = data[m.all(axis=1)]
print (df1)
one two
a b c a b c
1 0.573761 0.287728 -0.235634 0.953490 -1.689625 -0.344943
2 0.016905 -0.514984 0.244509 -0.189313 2.672172 0.464802
Upvotes: 3
Reputation: 20786
I think this can be done using query
;
data.query("some_column <1")
and get_level_values
data[data.index.get_level_values('some_column') < 1]
Upvotes: 1