harijay
harijay

Reputation: 11883

Pandas how to get a list of rows that have multiple values for an index level in a Multiindex DataFrame

I got this Dataframe from a pivot operation and I dont know how to handle "nested" or multiindex Dataframes in pandas.

The Dataframe looks like this example below, only there are many more rows than shown here. [edit: Added an additional "chr18" row to give a more illustrative example. This too needs to be filtered out]

mmc chrom   start     stop      experiment  isdone  strand  countL countR
3   chr18   2044696   2044716   hj-10_b_10  FALSE   -        12     12
            2060000   2061000   hj-10_b_10  FALSE   -        162    162
    chr3    95359191  95359212  hj-10_b_10  FALSE   -        2497   2497
                                hj-9_b_9    TRUE    -        3476   3477
                                hj1_100_3   TRUE    -        2351   2351
4   chr19   598940    598961    hj-10_b_10  FALSE   -        494    494
                                hj1_100_3*1 TRUE    -        211    211

I want to filter out from this DataFrame all of the "chrom" entries that have multiple entries in the experiment level , i.e select all the chrom and start , stop column that have multiple entries in the experiment index level.

Result Dataframe that I want ( note it does not have mmc : 3 chrom : 18 entries since the two entries have only one experiment "hj-10_b_10" each and so were not reproduced multiple times).

mmc chrom   start     stop      experiment  isdone  strand  countL countR
 3  chr3    95359191  95359212  hj-10_b_10  FALSE   -        2497   2497
                                hj-9_b_9    TRUE    -        3476   3477
                                hj1_100_3   TRUE    -        2351   2351
 4  chr19   598940    598961    hj-10_b_10  FALSE   -        494    494
                                hj1_100_3*1 TRUE    -        211    211

I could possibly do this outside of pandas but since I want to learn the pandas way.

How can I select from a massive Dataframe all of the entries that are exceed a certain count for a particular index level.

Update

You can use this code to create the MultiIndex DataFrame

import pandas
from pandas import DataFrame
index_tuples_mmc= [3,3,3,3,3,4,4]
index_tuples_chrom = ["chr18","chr18","chr3","chr3","chr3","chr19","chr19"]
index_tuples_start = ["2044696","2060000","95359191","95359191","95359191","598940","598940"]
index_tuples_stop = ["2044716" ,"2061000","95359212", "95359212" , "95359212" ,"598961" , "598961"]
index_tuples_experiment = ["hj-10_b_10","hj-10_b_10","hj-10_b_10","hj-9_b_9","hj1_100_3","hj-10_b_10","hj1_100_3*1"]
index_tuples_idone = ["FALSE","FALSE","FALSE","TRUE","TRUE","FALSE","TRUE"]
index_tuples_strand = ["-","-","-","-","-","-","-"]
arrays = [index_tuples_mmc,index_tuples_chrom,index_tuples_start,\
         index_tuples_stop,index_tuples_experiment,index_tuples_idone,\
         index_tuples_strand]
tuples = list(zip(*arrays))
index = pandas.MultiIndex.from_tuples(tuples,names=["mmc","chrom",\
                                                "start","stop","experiment","isdone",\
                                                   "strand"])
df2 = DataFrame([12,162,2497,3476,2351,494,211],index=index,columns=["countL"])
df2["countR"]=df2["countL"]

Upvotes: 2

Views: 274

Answers (2)

ye jiawei
ye jiawei

Reputation: 902

You can try this :

idx=pd.IndexSlice
df.loc[idx[:,['chr3','chr19'],:,:,:,:,:,],:]

want to learn more about MultiIndex / Advanced Indexing, look here http://pandas.pydata.org/pandas-docs/stable/advanced.html

Upvotes: 0

jezrael
jezrael

Reputation: 863166

I think you can use groupby by second level of MultiIndex chrom with filter:

print df
                              experiment isdone strand  countL  countR
mmc chrom start    stop                                               
3   chr18 2044696  2044716    hj-10_b_10  False      -      12      12
    chr3  95359191 95359212   hj-10_b_10  False      -    2497    2497
                   95359212     hj-9_b_9   True      -    3476    3477
                   95359212    hj1_100_3   True      -    2351    2351
4   chr19 598940   598961     hj-10_b_10  False      -     494     494
                   598961    hj1_100_3*1   True      -     211     211

print df.groupby(level=['chrom']).filter(lambda x: len(x) > 1)
                              experiment isdone strand  countL  countR
mmc chrom start    stop                                               
3   chr3  95359191 95359212   hj-10_b_10  False      -    2497    2497
                   95359212     hj-9_b_9   True      -    3476    3477
                   95359212    hj1_100_3   True      -    2351    2351
4   chr19 598940   598961     hj-10_b_10  False      -     494     494
                   598961    hj1_100_3*1   True      -     211     211

EDIT by new data:

I think you need add to groupby first and second level, so:

import pandas

dfnew= pandas.read_excel("sample_data_hj.xls")

dfpivotnew = pandas.pivot_table(dfnew, index=["mmc","chrom","start","end","experiment","is_bidirectional"],\
                   values=["segment_quality_reads","segment_reads",])

print dfpivotnew.groupby(level=['mmc','chrom']).filter(lambda y : len(y) > 3)
mmc chrom start end   experiment  is_bidirectional                          
0   chr2  13924 14585 hj-10_b_10  True                             635103   
          29212 29790 hj-9_b_9    True                             475119   
          61877 62241 hj1_104_3*1 True                              60455   
          68398 68852 hj1_104_3*2 True                              18625   
1   chr15 1578  1906  hj-10_b_10  True                              21702   
          4554  5155  hj-9_b_9    True                              26673   
          37960 38789 hj1_104_3*2 True                                753   
          88306 88800 hj1_104_3*1 True                               2967   
6   chr1  65952 66932 hj1_104_3*2 False                                 1   
          66084 67057 hj-9_b_9    False                                 1   
          77010 77843 hj-9_b_9    False                               262   
          86297 87142 hj1_104_3*1 False                                 3   
    chr2  29084 29705 hj1_104_3*2 False                              1756   
          39924 40755 hj1_104_3*2 False                                 1   
          90411 90940 hj1_104_3*1 False                              2656   
          92804 93655 hj1_104_3*2 False                              2215   
7   chr1  647   1092  hj1_104_3*2 False                               217   
          6598  7093  hj-10_b_10  False                              1454   
          7998  8475  hj1_104_3*2 False                                11   
          13056 13879 hj1_104_3*1 False                                42   
          24918 25781 hj1_104_3*2 False                                13   
          28406 29135 hj1_104_3*2 False                              7140   
          28464 29459 hj-10_b_10  False                              1298   
          34484 35221 hj1_104_3*2 False                               769   
          38353 39040 hj-9_b_9    False                               128   
          42462 43186 hj1_104_3*2 False                                 2   
          50676 51317 hj1_104_3*1 False                                 7   
          51434 52271 hj1_104_3*2 False                                13   
          53634 54486 hj1_104_3*2 False                                 6   
          54574 55350 hj1_104_3*1 False                               224   
...                                                                   ...   
16  chr8  52118 52641 hj1_104_3*2 False                              2198   
          52469 52827 hj-9_b_9    False                                93   
          62446 62869 hj1_104_3*1 False                              2632   
          63765 64709 hj-9_b_9    False                               480   
          94773 95603 hj1_104_3*2 False                                 6   
          95316 95740 hj1_104_3*2 False                                69   
    chr9  7442  7875  hj1_104_3*2 False                                56   
          17897 18332 hj-9_b_9    False                                 2   
          33998 34656 hj1_104_3*2 False                                 1   
          77111 77690 hj1_104_3*2 False                              1198   
          88270 88715 hj-9_b_9    False                                 1   
          93281 94063 hj1_104_3*1 False                               837   
17  chr10 56436 56997 hj1_104_3*2 False                                 2   
          61154 61464 hj1_104_3*1 False                                 1   
          70244 71170 hj1_104_3*1 False                                 2   
          96948 97775 hj-9_b_9    False                                 4   
    chr3  868   1837  hj1_104_3*1 False                                 2   
          53355 53743 hj-9_b_9    False                              1063   
          54418 54858 hj1_104_3*2 False                               499   
          84641 85546 hj1_104_3*1 False                                 1   
    chr6  22374 22924 hj1_104_3*2 False                                 6   
          27347 27928 hj1_104_3*2 False                                71   
          58950 59394 hj1_104_3*2 False                              1712   
          73141 73481 hj1_104_3*2 False                              1487   
    chr8  10611 11483 hj-9_b_9    False                               196   
          11139 11799 hj1_104_3*2 False                               479   
          36694 37343 hj1_104_3*2 False                               254   
          56915 57742 hj1_104_3*2 False                                76   
          62405 63164 hj1_104_3*2 False                               182   
          72352 72823 hj1_104_3*2 False                                 1   

                                                    segment_reads  
mmc chrom start end   experiment  is_bidirectional                 
0   chr2  13924 14585 hj-10_b_10  True                     642400  
          29212 29790 hj-9_b_9    True                     486464  
          61877 62241 hj1_104_3*1 True                      61237  
          68398 68852 hj1_104_3*2 True                      22056  
1   chr15 1578  1906  hj-10_b_10  True                      21732  
          4554  5155  hj-9_b_9    True                      26676  
          37960 38789 hj1_104_3*2 True                        753  
          88306 88800 hj1_104_3*1 True                       2969  
6   chr1  65952 66932 hj1_104_3*2 False                         1  
          66084 67057 hj-9_b_9    False                         1  
          77010 77843 hj-9_b_9    False                       716  
          86297 87142 hj1_104_3*1 False                         3  
    chr2  29084 29705 hj1_104_3*2 False                      1756  
          39924 40755 hj1_104_3*2 False                         1  
          90411 90940 hj1_104_3*1 False                      2656  
          92804 93655 hj1_104_3*2 False                      2215  
7   chr1  647   1092  hj1_104_3*2 False                       217  
          6598  7093  hj-10_b_10  False                      1454  
          7998  8475  hj1_104_3*2 False                        11  
          13056 13879 hj1_104_3*1 False                        78  
          24918 25781 hj1_104_3*2 False                        18  
          28406 29135 hj1_104_3*2 False                      7147  
          28464 29459 hj-10_b_10  False                      1298  
          34484 35221 hj1_104_3*2 False                       769  
          38353 39040 hj-9_b_9    False                       713  
          42462 43186 hj1_104_3*2 False                         3  
          50676 51317 hj1_104_3*1 False                         7  
          51434 52271 hj1_104_3*2 False                       729  
          53634 54486 hj1_104_3*2 False                         6  
          54574 55350 hj1_104_3*1 False                       224  
...                                                           ...  
16  chr8  52118 52641 hj1_104_3*2 False                      2198  
          52469 52827 hj-9_b_9    False                        93  
          62446 62869 hj1_104_3*1 False                      2632  
          63765 64709 hj-9_b_9    False                       480  
          94773 95603 hj1_104_3*2 False                         6  
          95316 95740 hj1_104_3*2 False                        69  
    chr9  7442  7875  hj1_104_3*2 False                        56  
          17897 18332 hj-9_b_9    False                         2  
          33998 34656 hj1_104_3*2 False                       192  
          77111 77690 hj1_104_3*2 False                      1198  
          88270 88715 hj-9_b_9    False                         1  
          93281 94063 hj1_104_3*1 False                       837  
17  chr10 56436 56997 hj1_104_3*2 False                         2  
          61154 61464 hj1_104_3*1 False                         1  
          70244 71170 hj1_104_3*1 False                        12  
          96948 97775 hj-9_b_9    False                         4  
    chr3  868   1837  hj1_104_3*1 False                         2  
          53355 53743 hj-9_b_9    False                      1063  
          54418 54858 hj1_104_3*2 False                       499  
          84641 85546 hj1_104_3*1 False                        24  
    chr6  22374 22924 hj1_104_3*2 False                         6  
          27347 27928 hj1_104_3*2 False                        71  
          58950 59394 hj1_104_3*2 False                      1712  
          73141 73481 hj1_104_3*2 False                      1487  
    chr8  10611 11483 hj-9_b_9    False                       238  
          11139 11799 hj1_104_3*2 False                       479  
          36694 37343 hj1_104_3*2 False                       254  
          56915 57742 hj1_104_3*2 False                        76  
          62405 63164 hj1_104_3*2 False                       182  
          72352 72823 hj1_104_3*2 False                         1  

[10330 rows x 2 columns]

Upvotes: 1

Related Questions