Reputation: 145

Selecting rows with multi-index columns

I have a df with multi-indexed columns, like this:

col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

I want to be able to select all rows where the values in one of the level 1 columns pass a certain test. If there were no multi-index on the columns I would say something like:

data[data['d']<1]

But of course that fails on a multindex. The level 1 indexes are unique, so I don't want to have to specify the level 0 index, just level 1. I'd like to return the table above but missing row 1, where d>1.

Upvotes: 5

Answers (4)

sammywemmy

Reputation: 28729

Using your supplied data, a combination of xs and squeeze can help with the filtering. This works on the assumption that the level 1 entries are unique, as indicated in your question :

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)
data

       one                                         two
         a          b         c            d            e          f
0   -0.217679   0.821455    1.481278    1.331864    -0.361865   0.685609
1   0.573761    0.287728    -0.235634   0.953490    -1.689625   -0.344943
2   0.016905    -0.514984   0.244509    -0.189313   2.672172    0.464802
3   0.845930    -0.503542   -0.963336   0.064969    -3.205040   1.054969

Say you want to filter for d less than 1 :

#squeeze turns it into a series, making it easy to pass to loc via boolean indexing
condition = data.xs('d',axis=1,level=1).lt(1).squeeze()
#or you could use loc : 
# condition = data.loc(axis=1)[:,'d'].lt(1).squeeze()

data.loc[condition]

        one                                              two
       a             b           c         d            e           f
1   0.573761    0.287728    -0.235634   0.953490    -1.689625   -0.344943
2   0.016905    -0.514984   0.244509    -0.189313   2.672172    0.464802
3   0.845930    -0.503542   -0.963336   0.064969    -3.205040   1.054969

Upvotes: 2

Andy Wilson

Reputation: 145

Thanks everyone for your help. As usual with these things, the specific answer to the problem is not as interesting as what you've learned in trying to fix it, and I learned a lot about .query, .xs and much more.

However, I ended up taking a side route to addressing my specific issue - namely that I copied the columns to a new variable, dropped an index, did my calculations, then put the original indexes in place. Eg:

cols = data.columns
data..droplevel(level=1, axis=1)
# do calculations
data.columns = cols

The advantage was that I could top and tail the operation modifying the indexes, but all the data manipulation in between used idioms I'm familiar with.

At some point I'll sit down and read about multi-indexes at length.

Upvotes: 0

jezrael

Reputation: 863531

If values are unique in second level id necessary convert mask from one column DataFrame to Series - possible solution with DataFrame.squeeze:

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'd', 'e', 'f']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)


print (data.xs('d', axis=1, level=1))
        two
0  1.331864
1  0.953490
2 -0.189313
3  0.064969

print (data.xs('d', axis=1, level=1).squeeze())
0    1.331864
1    0.953490
2   -0.189313
3    0.064969
Name: two, dtype: float64

print (data.xs('d', axis=1, level=1).squeeze().lt(1))
0    False
1     True
2     True
3     True
Name: two, dtype: bool

df = data[data.xs('d', axis=1, level=1).squeeze().lt(1)]

Alternative with DataFrame.iloc:

df = data[data.xs('d', axis=1, level=1).iloc[:, 0].lt(1)]

print (df)
        one                           two                    
          a         b         c         d         e         f
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802
3  0.845930 -0.503542 -0.963336  0.064969 -3.205040  1.054969

If working with MultiIndex after select is possible get multiple columns, like here if select by c level:

np.random.seed(2019)
col = pd.MultiIndex.from_arrays([['one', '', '', 'two', 'two', 'two'],
                                ['a', 'b', 'c', 'a', 'b', 'c']])
data = pd.DataFrame(np.random.randn(4, 6), columns=col)

So first select by DataFrame.xs and compare by DataFrame.lt for <

print (data.xs('c', axis=1, level=1))
                  two
0  1.481278  0.685609
1 -0.235634 -0.344943
2  0.244509  0.464802
3 -0.963336  1.054969

m = data.xs('c', axis=1, level=1).lt(1)
#alternative
#m = data.xs('c', axis=1, level=1) < 1
print (m)
            two
0  False   True
1   True   True
2   True   True
3   True  False

And then test if at least one True per rows by DataFrame.any and filter by boolean indexing:

df1 = data[m.any(axis=1)]
print (df1)
        one                           two                    
          a         b         c         a         b         c
0 -0.217679  0.821455  1.481278  1.331864 -0.361865  0.685609
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802
3  0.845930 -0.503542 -0.963336  0.064969 -3.205040  1.054969

Or test if all Trues per row by DataFrame.any with filtering:

df1 = data[m.all(axis=1)]
print (df1)
        one                           two                    
          a         b         c         a         b         c
1  0.573761  0.287728 -0.235634  0.953490 -1.689625 -0.344943
2  0.016905 -0.514984  0.244509 -0.189313  2.672172  0.464802

Upvotes: 3

Yagiz Degirmenci

Reputation: 20786

I think this can be done using query;

data.query("some_column <1")

and get_level_values

data[data.index.get_level_values('some_column') < 1]

Upvotes: 1

Selecting rows with multi-index columns

Answers (4)

Related Questions