Reputation: 1167
I have a MultiIndex Pandas DataFrame that looks like the following:
import pandas as pd
import numpy as np
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 99, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
I want to filter the df for any rows where at least one of the Samples (Sample1 or Sample2 in this case) has a DP >= 50 & GQ < 4. Under these conditions all rows should be filtered out except the first row.
I have no idea where to start with this and would appreciate some help.
EDIT:
I arrived at a solution thanks to the help of jezrael's post. The code is as follows:
genotype = df.ix[:,3:]
DP = genotype.xs('DP', axis=1, level=1)
GQ = genotype.xs('GQ', axis=1, level=1)
conditions = (DP.ge(50) & GQ.le(4)).T.any()
df = df[conditions]
return df
Upvotes: 3
Views: 10095
Reputation: 862481
I think you can use:
filter
Samples
columns xs
DP
and HQ
subsets and compare with ge
and lt
and
(&
) and get at least one True
by any
True
by idxmax
and select by loc
#data in sample change for matching (first 99 in HQ in Sample1 was changed to 3)
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.50 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.10 0/1 200,50 250 99 0.40
2 40 40 0/1 200,2 202 99 0.01 0/1 200,2 202 99 0.01
df1 = df.filter(like='Sample')
df = df.loc[[(df1.xs('DP', axis=1, level=1).ge(50) &
df1.xs('GQ', axis=1, level=1).lt(4)).any(1).idxmax()]]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.4
EDIT:
If need return all rows by condition, remove loc
and idmax
:
df1 = df.filter(like='Sample')
#changed condition to lt(10) (<10)
df = df[(df1.xs('DP', axis=1, level=1).ge(50) & df1.xs('GQ', axis=1, level=1).lt(10)).any(1)]
print (df)
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
0 12 100 0/1 120,60 180 5 0.5 0/1 200,2 202 99 0.01
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.40
Upvotes: 4
Reputation: 294218
stack
the first level and use query
to identify indices
df.loc[df.stack(0).query('DP >= 50 & GQ < 4').unstack().index]
INFO Sample1 Sample2
AC DEPTH GT AD DP GQ AB GT AD DP GQ AB
1 23 200 0/1 200,20 60 3 0.1 0/1 200,50 250 99 0.4
I used @jezrael's setup
genotype_data = [
['0/1', '120,60', 180, 5, 0.5, '0/1', '200,2', 202, 99, 0.01],
['0/1', '200,20', 60, 3, 0.1, '0/1', '200,50', 250, 99, 0.4],
['0/1', '200,2', 202, 99, 0.01, '0/1', '200,2', 202, 99, 0.01]
]
genotype_columns = [['Sample1', 'Sample2'], ['GT', 'AD', 'DP', 'GQ', 'AB']]
cols = pd.MultiIndex.from_product(genotype_columns)
genotype = pd.DataFrame(data=genotype_data, columns=cols)
info_columns = [['INFO'], ['AC', 'DEPTH']]
cols = pd.MultiIndex.from_product(info_columns)
info = pd.DataFrame(data=[[12, 100], [23, 200], [40, 40]], columns=cols)
df = pd.concat([info, genotype], axis=1)
Upvotes: 1