Reputation: 969
I have a dataframe which contains a sample_id and mutation: Each sample contains several mutations
sample_id mutation
sample1 mutation_A
sample1 mutation_B
sample1 mutation_D
sample2 mutation_C
sample2 mutation_D
sample3 mutation_A
sample3 mutation_B
sample3 mutation_C
I want to be able to obtain the values where say, mutation_C exists. However I want to get all the results out for that sample -
df.loc[(df[mutation] == 'mutation_C')]
returns:
sample_id mutation
sample2 mutation_C
How do I get the rest of sample2 mutation data, so:
sample_id mutation
sample2 mutation_C
sample2 mutation_D
I have been trying to use grouopby but can't figure out how to obtain all the results
Upvotes: 0
Views: 1638
Reputation: 1197
Assuming you have other data, a neater idea would be to set the index the way you are after.
(I've added a dummy column with df['value'] = 1
)
>>> a = df.set_index(['mutation', 'sample_id'])
>>> a.sort_index()
value
mutation sample_id
mutation_A sample1 1
sample3 1
mutation_B sample1 1
sample3 1
mutation_C sample2 1
sample3 1
mutation_D sample1 1
sample2 1
>>> a.loc['mutation_C']
value
sample_id
sample2 1
sample3 1
If you really need the sample_ids as a list then you could do:
>>> a.loc['mutation_C'].index.tolist()
['sample2', 'sample3']
Not what you asked but perhaps another useful view:
>>> df.pivot_table(values='value', index='sample_id', columns='mutation')
mutation mutation_A mutation_B mutation_C mutation_D
sample_id
sample1 1.0 1.0 NaN 1.0
sample2 NaN NaN 1.0 1.0
sample3 1.0 1.0 1.0 NaN
Upvotes: 0
Reputation: 862731
First filter all samples
and then filter again by isin
:
a = df.loc[df['mutation'] == 'mutation_C', 'sample_id']
df = df[df['sample_id'].isin(a)]
print (a)
3 sample2
7 sample3
Name: sample_id, dtype: object
df = df[df['sample_id'].isin(a)]
print (df)
sample_id mutation
3 sample2 mutation_C
4 sample2 mutation_D
5 sample3 mutation_A
6 sample3 mutation_B
7 sample3 mutation_C
Upvotes: 1