Reputation: 165
I have a data frame such as:
Cluster sequence_name
1 specie1
1 specie2
1 specie3
1 sequence1
1 sequence2
2 specie8
3 specie2
4 sequence1
4 sequence3
4 specie56
...
I would like to remove all the cluster that contain only one sequence, here in the exemple I should get:
Cluster sequence_name
1 specie1
1 specie2
1 specie3
1 sequence1
1 sequence2
4 sequence1
4 sequence3
4 specie56
...
Thank you for your help .
Upvotes: 1
Views: 28
Reputation: 38415
Groupby.filter works well here
df = df.groupby('Cluster').filter(lambda x: x.sequence_name.nunique() > 1)
Cluster sequence_name
0 1 specie1
1 1 specie2
2 1 specie3
3 1 sequence1
4 1 sequence2
7 4 sequence1
8 4 sequence3
9 4 specie56
Upvotes: 1
Reputation: 14103
Boolean indexing with groupby
and transform
:
df[df.groupby('Cluster')['sequence_name'].transform('size') > 1]
Cluster sequence_name
0 1 specie1
1 1 specie2
2 1 specie3
3 1 sequence1
4 1 sequence2
7 4 sequence1
8 4 sequence3
9 4 specie56
Upvotes: 1