chippycentra
chippycentra

Reputation: 3432

Remove duplicate within column depending on 2 lists in python

I have a dataframe such as :

Groups NAME          LETTER
G1     Canis_lupus   A
G1     Canis_lupus   B
G1     Canis_lupus   F
G1     Cattus_cattus C
G1     Cattus_cattus C
G2     Canis_lupus   C
G2     Zebra_fish    A
G2     Zebra_fish    D
G2     Zebra-fish    B
G2     Cattus_cattus D
G2     Cattus_cattus E

and the idea is that I would like within Groups to keep only two duplicated NAME where LETTER is within the list1=['A','B','C'] and list2=['D','E','F']

When there is for instance duplicate having A and B, I keep the A in the alphabet order

In the example I should then get :

 Groups NAME          LETTER
G1     Canis_lupus   A
G1     Canis_lupus   F
G1     Cattus_cattus C
G2     Canis_lupus   C
G2     Zebra_fish    A
G2     Zebra_fish    D
G2     Cattus_cattus D

Here is tha dataframe

{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}

Upvotes: 0

Views: 44

Answers (2)

jezrael
jezrael

Reputation: 862581

Idea is first sorting by LETTER if necessary and then filter by both lists with Series.isin, remove duplicates by DataFrame.drop_duplicates and last join together by concat:

#sorting per groups
df = df.sort_values(['Groups','LETTER'])

#sorting by one column
#df = df.sort_values('LETTER')


list1=['A','B','C'] 
list2=['D','E','F']

df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])


df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
  Groups           NAME LETTER
0     G1    Canis_lupus      A
1     G1    Canis_lupus      F
2     G1  Cattus_cattus      C
3     G2    Canis_lupus      C
4     G2     Zebra_fish      A
5     G2     Zebra_fish      D
6     G2  Cattus_cattus      D

Idea with mapping values to new column with merge dictionaries, similar like another solution, only also removed rows if no match both lists by DataFrame.dropna, last remove helper column and sorting:

d = {**dict.fromkeys(list1, 'a'),
     **dict.fromkeys(list2, 'b')}


df = (df.assign(new = df.LETTER.map(d))
        .dropna(subset=['new'])
        .drop_duplicates(subset=['Groups', 'NAME', 'new'])
        .sort_index(ignore_index=True)
        .drop('new', 1)
        )

print (df)
  Groups           NAME LETTER
0     G1    Canis_lupus      A
1     G1    Canis_lupus      F
2     G1  Cattus_cattus      C
3     G2    Canis_lupus      C
4     G2     Zebra_fish      A
5     G2     Zebra_fish      D
6     G2  Cattus_cattus      D

Upvotes: 1

Nk03
Nk03

Reputation: 14949

Create a list_id column to identify which list a particular letter belongs to. Then just drop the duplicates using the subset parameter.

condlist = [df.LETTER.isin(list1),
            df.LETTER.isin(list2)]

choicelist = [
    'list1',
    'list2'
]

df['list_id'] = np.select(condlist, choicelist)
df = df.sort_values('LETTER').drop_duplicates(
    subset=['Groups', 'NAME', 'list_id']).drop('list_id', 1).sort_values(['Groups', 'NAME'])

OUTPUT:

  Groups           NAME LETTER
0     G1    Canis_lupus      A
2     G1    Canis_lupus      F
3     G1  Cattus_cattus      C
5     G2    Canis_lupus      C
9     G2  Cattus_cattus      D
6     G2     Zebra_fish      A
7     G2     Zebra_fish      D

Upvotes: 0

Related Questions