Reputation: 199
I've got a large set of data where I'm trying to remove duplicates based on 2 fields. Sample set:
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
2347578 US Maine ENG State 23424977 1 sampleuser
2444322 US Maine ENG Town 12588275 1 sampleuser
2444324 US Maine ENG Town 12588852 1 sampleuser
2444326 US Maine ENG POI 12589403 1 sampleuser
2444327 US Maine ENG Town 12587582 1 sampleuser
2444325 US Maine ENG Country 12589315 1 sampleuser
28744443US Maine ENG Town 12590578 1 sampleuser
2444323 US Maine ENG Town 2374968 1 sampleuser
As these are all duplicate values of ID (1) I want to keep only the entry the biggest Placetype (being country here, where country>state>town>POI). Is there a simple way to do this that I'm overlooking or do I have to write a loop that compares all the entries? Id rather not do that because there's over 3 million entries in the total database and I will probably have to run it several times.
Thanks in advance!
Upvotes: 4
Views: 1564
Reputation: 863226
I think you can use ordered Categorical
, then sort DataFrame
by column Placetype
by sort_values
and then groupby
with aggregating first
:
print (df)
WOE_ID ISO Locationname Language Placetype Parent_ID ID Username
0 2347578 US Maine ENG State 23424977 1 sampleuser
1 2444322 US Maine ENG Town 12588275 1 sampleuser
2 2444324 US Maine ENG Town 12588852 1 sampleuser
3 2444326 US Maine ENG POI 12589403 2 sampleuser
4 2444327 US Maine ENG Town 12587582 2 sampleuser
5 2444325 US Maine ENG Country 12589315 3 sampleuser
6 28744443 US Maine ENG Town 12590578 3 sampleuser
7 2444323 US Maine ENG Town 2374968 3 sampleuser
df.Placetype = pd.Categorical(df.Placetype,
categories=['Country','State','Town','POI'],
ordered=True)
df = df.sort_values('Placetype').groupby('ID', as_index=False).first()
print (df)
ID WOE_ID ISO Locationname Language Placetype Parent_ID Username
0 1 2347578 US Maine ENG State 23424977 sampleuser
1 2 2444327 US Maine ENG Town 12587582 sampleuser
2 3 2444325 US Maine ENG Country 12589315 sampleuser
Upvotes: 7