Reputation: 11883
I got this Dataframe from a pivot operation and I dont know how to handle "nested" or multiindex Dataframes in pandas.
The Dataframe looks like this example below, only there are many more rows than shown here. [edit: Added an additional "chr18" row to give a more illustrative example. This too needs to be filtered out]
mmc chrom start stop experiment isdone strand countL countR
3 chr18 2044696 2044716 hj-10_b_10 FALSE - 12 12
2060000 2061000 hj-10_b_10 FALSE - 162 162
chr3 95359191 95359212 hj-10_b_10 FALSE - 2497 2497
hj-9_b_9 TRUE - 3476 3477
hj1_100_3 TRUE - 2351 2351
4 chr19 598940 598961 hj-10_b_10 FALSE - 494 494
hj1_100_3*1 TRUE - 211 211
I want to filter out from this DataFrame all of the "chrom" entries that have multiple entries in the experiment level , i.e select all the chrom and start , stop column that have multiple entries in the experiment index level.
Result Dataframe that I want ( note it does not have mmc : 3 chrom : 18 entries since the two entries have only one experiment "hj-10_b_10" each and so were not reproduced multiple times).
mmc chrom start stop experiment isdone strand countL countR
3 chr3 95359191 95359212 hj-10_b_10 FALSE - 2497 2497
hj-9_b_9 TRUE - 3476 3477
hj1_100_3 TRUE - 2351 2351
4 chr19 598940 598961 hj-10_b_10 FALSE - 494 494
hj1_100_3*1 TRUE - 211 211
I could possibly do this outside of pandas but since I want to learn the pandas way.
How can I select from a massive Dataframe all of the entries that are exceed a certain count for a particular index level.
Update
You can use this code to create the MultiIndex DataFrame
import pandas
from pandas import DataFrame
index_tuples_mmc= [3,3,3,3,3,4,4]
index_tuples_chrom = ["chr18","chr18","chr3","chr3","chr3","chr19","chr19"]
index_tuples_start = ["2044696","2060000","95359191","95359191","95359191","598940","598940"]
index_tuples_stop = ["2044716" ,"2061000","95359212", "95359212" , "95359212" ,"598961" , "598961"]
index_tuples_experiment = ["hj-10_b_10","hj-10_b_10","hj-10_b_10","hj-9_b_9","hj1_100_3","hj-10_b_10","hj1_100_3*1"]
index_tuples_idone = ["FALSE","FALSE","FALSE","TRUE","TRUE","FALSE","TRUE"]
index_tuples_strand = ["-","-","-","-","-","-","-"]
arrays = [index_tuples_mmc,index_tuples_chrom,index_tuples_start,\
index_tuples_stop,index_tuples_experiment,index_tuples_idone,\
index_tuples_strand]
tuples = list(zip(*arrays))
index = pandas.MultiIndex.from_tuples(tuples,names=["mmc","chrom",\
"start","stop","experiment","isdone",\
"strand"])
df2 = DataFrame([12,162,2497,3476,2351,494,211],index=index,columns=["countL"])
df2["countR"]=df2["countL"]
Upvotes: 2
Views: 274
Reputation: 902
You can try this :
idx=pd.IndexSlice
df.loc[idx[:,['chr3','chr19'],:,:,:,:,:,],:]
want to learn more about MultiIndex / Advanced Indexing, look here http://pandas.pydata.org/pandas-docs/stable/advanced.html
Upvotes: 0
Reputation: 863166
I think you can use groupby
by second level of MultiIndex
chrom
with filter
:
print df
experiment isdone strand countL countR
mmc chrom start stop
3 chr18 2044696 2044716 hj-10_b_10 False - 12 12
chr3 95359191 95359212 hj-10_b_10 False - 2497 2497
95359212 hj-9_b_9 True - 3476 3477
95359212 hj1_100_3 True - 2351 2351
4 chr19 598940 598961 hj-10_b_10 False - 494 494
598961 hj1_100_3*1 True - 211 211
print df.groupby(level=['chrom']).filter(lambda x: len(x) > 1)
experiment isdone strand countL countR
mmc chrom start stop
3 chr3 95359191 95359212 hj-10_b_10 False - 2497 2497
95359212 hj-9_b_9 True - 3476 3477
95359212 hj1_100_3 True - 2351 2351
4 chr19 598940 598961 hj-10_b_10 False - 494 494
598961 hj1_100_3*1 True - 211 211
EDIT by new data:
I think you need add to groupby
first and second level, so:
import pandas
dfnew= pandas.read_excel("sample_data_hj.xls")
dfpivotnew = pandas.pivot_table(dfnew, index=["mmc","chrom","start","end","experiment","is_bidirectional"],\
values=["segment_quality_reads","segment_reads",])
print dfpivotnew.groupby(level=['mmc','chrom']).filter(lambda y : len(y) > 3)
mmc chrom start end experiment is_bidirectional
0 chr2 13924 14585 hj-10_b_10 True 635103
29212 29790 hj-9_b_9 True 475119
61877 62241 hj1_104_3*1 True 60455
68398 68852 hj1_104_3*2 True 18625
1 chr15 1578 1906 hj-10_b_10 True 21702
4554 5155 hj-9_b_9 True 26673
37960 38789 hj1_104_3*2 True 753
88306 88800 hj1_104_3*1 True 2967
6 chr1 65952 66932 hj1_104_3*2 False 1
66084 67057 hj-9_b_9 False 1
77010 77843 hj-9_b_9 False 262
86297 87142 hj1_104_3*1 False 3
chr2 29084 29705 hj1_104_3*2 False 1756
39924 40755 hj1_104_3*2 False 1
90411 90940 hj1_104_3*1 False 2656
92804 93655 hj1_104_3*2 False 2215
7 chr1 647 1092 hj1_104_3*2 False 217
6598 7093 hj-10_b_10 False 1454
7998 8475 hj1_104_3*2 False 11
13056 13879 hj1_104_3*1 False 42
24918 25781 hj1_104_3*2 False 13
28406 29135 hj1_104_3*2 False 7140
28464 29459 hj-10_b_10 False 1298
34484 35221 hj1_104_3*2 False 769
38353 39040 hj-9_b_9 False 128
42462 43186 hj1_104_3*2 False 2
50676 51317 hj1_104_3*1 False 7
51434 52271 hj1_104_3*2 False 13
53634 54486 hj1_104_3*2 False 6
54574 55350 hj1_104_3*1 False 224
... ...
16 chr8 52118 52641 hj1_104_3*2 False 2198
52469 52827 hj-9_b_9 False 93
62446 62869 hj1_104_3*1 False 2632
63765 64709 hj-9_b_9 False 480
94773 95603 hj1_104_3*2 False 6
95316 95740 hj1_104_3*2 False 69
chr9 7442 7875 hj1_104_3*2 False 56
17897 18332 hj-9_b_9 False 2
33998 34656 hj1_104_3*2 False 1
77111 77690 hj1_104_3*2 False 1198
88270 88715 hj-9_b_9 False 1
93281 94063 hj1_104_3*1 False 837
17 chr10 56436 56997 hj1_104_3*2 False 2
61154 61464 hj1_104_3*1 False 1
70244 71170 hj1_104_3*1 False 2
96948 97775 hj-9_b_9 False 4
chr3 868 1837 hj1_104_3*1 False 2
53355 53743 hj-9_b_9 False 1063
54418 54858 hj1_104_3*2 False 499
84641 85546 hj1_104_3*1 False 1
chr6 22374 22924 hj1_104_3*2 False 6
27347 27928 hj1_104_3*2 False 71
58950 59394 hj1_104_3*2 False 1712
73141 73481 hj1_104_3*2 False 1487
chr8 10611 11483 hj-9_b_9 False 196
11139 11799 hj1_104_3*2 False 479
36694 37343 hj1_104_3*2 False 254
56915 57742 hj1_104_3*2 False 76
62405 63164 hj1_104_3*2 False 182
72352 72823 hj1_104_3*2 False 1
segment_reads
mmc chrom start end experiment is_bidirectional
0 chr2 13924 14585 hj-10_b_10 True 642400
29212 29790 hj-9_b_9 True 486464
61877 62241 hj1_104_3*1 True 61237
68398 68852 hj1_104_3*2 True 22056
1 chr15 1578 1906 hj-10_b_10 True 21732
4554 5155 hj-9_b_9 True 26676
37960 38789 hj1_104_3*2 True 753
88306 88800 hj1_104_3*1 True 2969
6 chr1 65952 66932 hj1_104_3*2 False 1
66084 67057 hj-9_b_9 False 1
77010 77843 hj-9_b_9 False 716
86297 87142 hj1_104_3*1 False 3
chr2 29084 29705 hj1_104_3*2 False 1756
39924 40755 hj1_104_3*2 False 1
90411 90940 hj1_104_3*1 False 2656
92804 93655 hj1_104_3*2 False 2215
7 chr1 647 1092 hj1_104_3*2 False 217
6598 7093 hj-10_b_10 False 1454
7998 8475 hj1_104_3*2 False 11
13056 13879 hj1_104_3*1 False 78
24918 25781 hj1_104_3*2 False 18
28406 29135 hj1_104_3*2 False 7147
28464 29459 hj-10_b_10 False 1298
34484 35221 hj1_104_3*2 False 769
38353 39040 hj-9_b_9 False 713
42462 43186 hj1_104_3*2 False 3
50676 51317 hj1_104_3*1 False 7
51434 52271 hj1_104_3*2 False 729
53634 54486 hj1_104_3*2 False 6
54574 55350 hj1_104_3*1 False 224
... ...
16 chr8 52118 52641 hj1_104_3*2 False 2198
52469 52827 hj-9_b_9 False 93
62446 62869 hj1_104_3*1 False 2632
63765 64709 hj-9_b_9 False 480
94773 95603 hj1_104_3*2 False 6
95316 95740 hj1_104_3*2 False 69
chr9 7442 7875 hj1_104_3*2 False 56
17897 18332 hj-9_b_9 False 2
33998 34656 hj1_104_3*2 False 192
77111 77690 hj1_104_3*2 False 1198
88270 88715 hj-9_b_9 False 1
93281 94063 hj1_104_3*1 False 837
17 chr10 56436 56997 hj1_104_3*2 False 2
61154 61464 hj1_104_3*1 False 1
70244 71170 hj1_104_3*1 False 12
96948 97775 hj-9_b_9 False 4
chr3 868 1837 hj1_104_3*1 False 2
53355 53743 hj-9_b_9 False 1063
54418 54858 hj1_104_3*2 False 499
84641 85546 hj1_104_3*1 False 24
chr6 22374 22924 hj1_104_3*2 False 6
27347 27928 hj1_104_3*2 False 71
58950 59394 hj1_104_3*2 False 1712
73141 73481 hj1_104_3*2 False 1487
chr8 10611 11483 hj-9_b_9 False 238
11139 11799 hj1_104_3*2 False 479
36694 37343 hj1_104_3*2 False 254
56915 57742 hj1_104_3*2 False 76
62405 63164 hj1_104_3*2 False 182
72352 72823 hj1_104_3*2 False 1
[10330 rows x 2 columns]
Upvotes: 1