Reputation: 783
Hello I have a list of element such as :
list_element=['Elephant','Monkey','Cow','Human','Bird','Snail','Snake','Donkey','Baboon','Orang-Outan']
and a dataframe
name value
G1 Gr.1:4282399-4282564(+):Elephant
G1 SEQAHAHHE
G1 Zr.2:4282387-428245(-):Monkey
G1 GrA.2:42845-428289(+):Monkey
G1 QYEH897EH.3
G1 GrA2S2_ED:42845-4282789(+):Cow
G1 UDDKDDH6
G1 YDDIJBDIB778
G2 Gr.1:423663-4282542(-):Elephant
G2 Gr7E:423609-4282552(+):Elephant
G2 UEHHEE88E8E.2
G2 AP_UUD1_CU_OK-lQGGQ
G2 GrEH:423663-4282542(+):Baboon
G2 Gr7JE:42356-428257(+):Snail
G2 AP_UUD1_CU_OK-lQ8900
G2 ASGSG_E553:423663-4282542(-):Human
G3 GrA98_OK:42845-42867(+):Bird
G3 AGGAGA5567
I keep the G1
because we have a total of element <= 3
(Monkey, Elephant and Cow)
I remove G2
because we have a total of element > 3
(Elephant, Human, Snail and Baboon)
I keep G3
because there is a total of element <= 3
(Bird)
As you can see we fin the same pattern in value for element that contain a '):'
and the expected output would be :
name value
G1 Gr.1:4282399-4282564(+):Elephant
G1 SEQAHAHHE
G1 Zr.2:4282387-428245(-):Monkey
G1 GrA.2:42845-428289(+):Monkey
G1 QYEH897EH.3
G1 GrA2S2_ED:42845-4282789(+):Cow
G1 UDDKDDH6
G1 YDDIJBDIB778
G3 GrA98_OK:42845-42867(+):Bird
G3 AGGAGA5567
Thanks for your help
Upvotes: 1
Views: 56
Reputation: 195418
df = df.groupby('name').filter(lambda x: len(set(x[x['value'].str.contains(':')]['value'].str.split(':').str[-1].values)) <= 3)
print(df)
Prints:
name value
0 G1 Gr.1:4282399-4282564(+):Elephant
1 G1 SEQAHAHHE
2 G1 Zr.2:4282387-428245(-):Monkey
3 G1 GrA.2:42845-428289(+):Monkey
4 G1 QYEH897EH.3
5 G1 GrA2S2_ED:42845-4282789(+):Cow
6 G1 UDDKDDH6
7 G1 YDDIJBDIB778
16 G3 GrA98_OK:42845-42867(+):Bird
17 G3 AGGAGA5567
Upvotes: 2
Reputation: 150735
You can use .str.extract
to extract the elements, then groupby().nunique()
to count the number of unique elements:
s = (df['value'].str.extract('({})'.format('|'.join(list_element)) )[0]
.groupby(df['name'])
.transform('nunique') )
df[s<=3]
Output:
name value
0 G1 Gr.1:4282399-4282564(+):Elephant
1 G1 SEQAHAHHE
2 G1 Zr.2:4282387-428245(-):Monkey
3 G1 GrA.2:42845-428289(+):Monkey
4 G1 QYEH897EH.3
5 G1 GrA2S2_ED:42845-4282789(+):Cow
6 G1 UDDKDDH6
7 G1 YDDIJBDIB778
16 G3 GrA98_OK:42845-42867(+):Bird
17 G3 AGGAGA5567
Upvotes: 2