Reputation: 3432
I have a dataframe such as :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus B
G1 Canis_lupus F
G1 Cattus_cattus C
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Zebra-fish B
G2 Cattus_cattus D
G2 Cattus_cattus E
and the idea is that I would like within Groups
to keep only two duplicated NAME
where LETTER
is within the list1=['A','B','C']
and list2=['D','E','F']
When there is for instance duplicate having A
and B
, I keep the A
in the alphabet order
In the example I should then get :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus F
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Cattus_cattus D
Here is tha dataframe
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}
Upvotes: 0
Views: 44
Reputation: 862581
Idea is first sorting by LETTER
if necessary and then filter by both lists with Series.isin
, remove duplicates by DataFrame.drop_duplicates
and last join together by concat
:
#sorting per groups
df = df.sort_values(['Groups','LETTER'])
#sorting by one column
#df = df.sort_values('LETTER')
list1=['A','B','C']
list2=['D','E','F']
df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])
df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Idea with mapping values to new column with merge dictionaries, similar like another solution, only also removed rows if no match both list
s by DataFrame.dropna
, last remove helper column and sorting:
d = {**dict.fromkeys(list1, 'a'),
**dict.fromkeys(list2, 'b')}
df = (df.assign(new = df.LETTER.map(d))
.dropna(subset=['new'])
.drop_duplicates(subset=['Groups', 'NAME', 'new'])
.sort_index(ignore_index=True)
.drop('new', 1)
)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Upvotes: 1
Reputation: 14949
Create a list_id
column to identify which list a particular letter belongs to. Then just drop the duplicates using the subset
parameter.
condlist = [df.LETTER.isin(list1),
df.LETTER.isin(list2)]
choicelist = [
'list1',
'list2'
]
df['list_id'] = np.select(condlist, choicelist)
df = df.sort_values('LETTER').drop_duplicates(
subset=['Groups', 'NAME', 'list_id']).drop('list_id', 1).sort_values(['Groups', 'NAME'])
OUTPUT:
Groups NAME LETTER
0 G1 Canis_lupus A
2 G1 Canis_lupus F
3 G1 Cattus_cattus C
5 G2 Canis_lupus C
9 G2 Cattus_cattus D
6 G2 Zebra_fish A
7 G2 Zebra_fish D
Upvotes: 0