Removing duplicates on 1 field based on priority list from another field in pandas

Question

I've got a large set of data where I'm trying to remove duplicates based on 2 fields. Sample set:

WOE_ID  ISO Locationname    Language    Placetype   Parent_ID   ID  Username
2347578 US  Maine           ENG         State       23424977    1   sampleuser
2444322 US  Maine           ENG         Town        12588275    1   sampleuser
2444324 US  Maine           ENG         Town        12588852    1   sampleuser
2444326 US  Maine           ENG         POI         12589403    1   sampleuser
2444327 US  Maine           ENG         Town        12587582    1   sampleuser
2444325 US  Maine           ENG         Country     12589315    1   sampleuser
28744443US  Maine           ENG         Town        12590578    1   sampleuser
2444323 US  Maine           ENG         Town        2374968     1   sampleuser

As these are all duplicate values of ID (1) I want to keep only the entry the biggest Placetype (being country here, where country>state>town>POI). Is there a simple way to do this that I'm overlooking or do I have to write a loop that compares all the entries? Id rather not do that because there's over 3 million entries in the total database and I will probably have to run it several times.

Thanks in advance!

jezrael · Accepted Answer

I think you can use ordered Categorical, then sort DataFrame by column Placetype by sort_values and then groupby with aggregating first:

print (df)
     WOE_ID ISO Locationname Language Placetype  Parent_ID  ID    Username
0   2347578  US        Maine      ENG     State   23424977   1  sampleuser
1   2444322  US        Maine      ENG      Town   12588275   1  sampleuser
2   2444324  US        Maine      ENG      Town   12588852   1  sampleuser
3   2444326  US        Maine      ENG       POI   12589403   2  sampleuser
4   2444327  US        Maine      ENG      Town   12587582   2  sampleuser
5   2444325  US        Maine      ENG   Country   12589315   3  sampleuser
6  28744443  US        Maine      ENG      Town   12590578   3  sampleuser
7   2444323  US        Maine      ENG      Town    2374968   3  sampleuser


df.Placetype = pd.Categorical(df.Placetype, 
                              categories=['Country','State','Town','POI'], 
                              ordered=True)

df = df.sort_values('Placetype').groupby('ID', as_index=False).first()
print (df)
   ID   WOE_ID ISO Locationname Language Placetype  Parent_ID    Username
0   1  2347578  US        Maine      ENG     State   23424977  sampleuser
1   2  2444327  US        Maine      ENG      Town   12587582  sampleuser
2   3  2444325  US        Maine      ENG   Country   12589315  sampleuser

Removing duplicates on 1 field based on priority list from another field in pandas

Answers (1)

Related Questions