Grendel
Grendel

Reputation: 783

Remove groups if column contain more than x number of value in a list

Hello I have a list of element such as :

list_element=['Elephant','Monkey','Cow','Human','Bird','Snail','Snake','Donkey','Baboon','Orang-Outan']

and a dataframe

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G2    Gr.1:423663-4282542(-):Elephant
G2    Gr7E:423609-4282552(+):Elephant
G2    UEHHEE88E8E.2
G2    AP_UUD1_CU_OK-lQGGQ
G2    GrEH:423663-4282542(+):Baboon
G2    Gr7JE:42356-428257(+):Snail
G2    AP_UUD1_CU_OK-lQ8900
G2    ASGSG_E553:423663-4282542(-):Human
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

I keep the G1 because we have a total of element <= 3 (Monkey, Elephant and Cow)

I remove G2 because we have a total of element > 3 (Elephant, Human, Snail and Baboon)

I keep G3 because there is a total of element <= 3 (Bird)

As you can see we fin the same pattern in value for element that contain a '):'

and the expected output would be :

name  value
G1    Gr.1:4282399-4282564(+):Elephant
G1    SEQAHAHHE
G1    Zr.2:4282387-428245(-):Monkey
G1    GrA.2:42845-428289(+):Monkey
G1    QYEH897EH.3
G1    GrA2S2_ED:42845-4282789(+):Cow
G1    UDDKDDH6
G1    YDDIJBDIB778
G3    GrA98_OK:42845-42867(+):Bird
G3    AGGAGA5567

Thanks for your help

Upvotes: 1

Views: 56

Answers (2)

Andrej Kesely
Andrej Kesely

Reputation: 195418

df = df.groupby('name').filter(lambda x: len(set(x[x['value'].str.contains(':')]['value'].str.split(':').str[-1].values)) <= 3)
print(df)

Prints:

   name                             value
0    G1  Gr.1:4282399-4282564(+):Elephant
1    G1                         SEQAHAHHE
2    G1     Zr.2:4282387-428245(-):Monkey
3    G1      GrA.2:42845-428289(+):Monkey
4    G1                       QYEH897EH.3
5    G1    GrA2S2_ED:42845-4282789(+):Cow
6    G1                          UDDKDDH6
7    G1                      YDDIJBDIB778
16   G3      GrA98_OK:42845-42867(+):Bird
17   G3                        AGGAGA5567

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

You can use .str.extract to extract the elements, then groupby().nunique() to count the number of unique elements:

s = (df['value'].str.extract('({})'.format('|'.join(list_element)) )[0]
    .groupby(df['name'])
    .transform('nunique') )

df[s<=3]

Output:

   name                             value
0    G1  Gr.1:4282399-4282564(+):Elephant
1    G1                         SEQAHAHHE
2    G1     Zr.2:4282387-428245(-):Monkey
3    G1      GrA.2:42845-428289(+):Monkey
4    G1                       QYEH897EH.3
5    G1    GrA2S2_ED:42845-4282789(+):Cow
6    G1                          UDDKDDH6
7    G1                      YDDIJBDIB778
16   G3      GrA98_OK:42845-42867(+):Bird
17   G3                        AGGAGA5567

Upvotes: 2

Related Questions