germ
germ

Reputation: 1689

Multiple random selection from MultiIndex

Consider the following DataFrame:

import pandas as pd
arrays = [['A','A','B','B','C','C'],[1,1,3,3,5,5,],[2,2,4,4,6,6],[0.1,0.2,0.3,0.4,0.5,0.6]]
index = pd.MultiIndex.from_arrays(arrays,names=('Sample','P1','P2','T'))
data = np.random.rand(10,6)
df = pd.DataFrame(columns=index,data=data)

I want to select: for sample A, the column with T=0.2, and for sample C, the column with T=0.5.

I can easily select each of the single columns, e.g.:

df.loc[:,('A',slice(None),slice(None),0.2)]  # or
df.loc(axis=1)[('C',slice(None),slice(None),0.5)] 

But how can I combine them? I tried supplying a list of tuples:

df.loc[:,[('A',slice(None),slice(None),0.2),('C',slice(None),slice(None),0.5)]]

But that generates an error.

How can I select my columns without resorting to pd.concat?

Upvotes: 1

Views: 59

Answers (2)

sammywemmy
sammywemmy

Reputation: 28729

@QuangHoang has suggested an easy way with query in the comments. Another option, with a third party library- pyjanitor - allows for multiple selections on a MultiIndex with a dictionary, using the select method. I am a contributor to the library:

# pip install pyjanitor
import pandas as pd
import janitor

df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

As shown in the example code above, the key of the dictionary is the level in the MultiIndex, while the value is the actual label you wish to select.

Another option is to extract the indices with a boolean array, before selecting with loc. The advantage of this is clarity - you know exactly what you are selecting, and you have more control:

cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
indexer=df.columns[cond1|cond2]
indexer
MultiIndex([('A', 1, 2, 0.2),
            ('C', 5, 6, 0.5)],
           names=['Sample', 'P1', 'P2', 'T'])
df.loc[:, indexer]
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.240285  0.966722
1       0.537845  0.605961
2       0.581185  0.680213
3       0.504225  0.158793
4       0.673914  0.020176
5       0.528920  0.616641
6       0.189066  0.612055
7       0.766254  0.161062
8       0.077120  0.593749
9       0.922863  0.217441

Another option is a combination pandas' xs method and concat function:

pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
Out[9]:
Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.427995  0.257755
1       0.520026  0.370271
2       0.280932  0.402916
3       0.901517  0.337662
4       0.179267  0.399471
...          ...       ...
199995  0.620268  0.215945
199996  0.420360  0.544633
199997  0.793744  0.153501
199998  0.253700  0.588854
199999  0.201630  0.177926

[200000 rows x 2 columns]

a look at performance of the various solutions:

# let's increase size of the dataframe
In [54]: df = pd.concat([df]*20_000, ignore_index=True)

In [55]: df.shape
Out[55]: (200000, 6)
# manual form
In [56]: %%timeit
    ...: cond1=(df.columns.get_level_values('Sample')=='A') & (df.columns.get_level_values('T')==0.2)
    ...: cond2=(df.columns.get_level_values('Sample')=='C') & (df.columns.get_level_values('T')==0.5)
    ...: indexer=df.columns[cond1|cond2]
    ...: df.loc[:, indexer]
    ...:
    ...:
680 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# third party library - pyjanitor
In [57]: %timeit df.select(columns=[{'Sample':'A', 'T':0.2}, {'Sample':'C', 'T':0.5}])
603 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# panda kim's solution
In [58]: %timeit df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]
753 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# xs + concat
%timeit pd.concat([df.xs(key=0.2, level='T', axis=1, drop_level=False), df.xs(key=0.5,level='T',axis=1, drop_level=False)],axis=1)
1.03 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# query seems expensive - not sure why
In [59]: %timeit df.T.query('(Sample == "A" & T==0.2) or (Sample == "C" & T==0.5)').T
109 ms ± 634 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 2

Panda Kim
Panda Kim

Reputation: 13257

use boolean indexing

out = df.loc[:, df.columns.droplevel([1, 2]).isin([('A', 0.2), ('C', 0.5)])]

out:

Sample         A         C
P1             1         5
P2             2         6
T            0.2       0.5
0       0.836079  0.368242
1       0.870087  0.520477
2       0.582020  0.105908
3       0.736918  0.324141
4       0.386489  0.613063
5       0.969809  0.358152
6       0.325047  0.958949
7       0.995300  0.474698
8       0.674752  0.949571
9       0.622846  0.878193

Upvotes: 2

Related Questions