Reputation: 62
I have a Pandas dataframe
ID,col_1,col_2,col_3,PostCode,City,Country
22,Apart 342,The abc,xyz,X01909,London,UK
33,Apt 342,xyz,The abc,X01909,London,UK
41,Apt 342,The abc,xyz,X01909,London,UK
Above records are duplicates and one column value is present in other column, I need to delete these duplicates on a very large duplicates.
I have tried df=df.drop_duplicates(keep='first')
but it doesn't consider mis spelled data
More Sample
ID,col_1,col_2,col_3,PostCode,City,Country
1,Home 25,pearse strret,near spire, 4892,Manchester,UK
2,Pearse,Home no 25,near spire,4892,Manchester,UK
3,Home 25,Pearse street,near spire,4892,Manchester,UK
Upvotes: 0
Views: 79
Reputation: 146
Pandas drop_duplicates
can only perform exact comparisons.
So if you compare The abc, xyz
and xyz, The abc
it will see them as non-duplicates. Same goes for capitalization ("the abc" and "The Abc") and misspellings.
The presence of misspelling makes this virtually impossible to solve this perfectly.
If you want to use drop_duplicates
I would suggest to try and clean the data as much as you can:
After this, you can try and use df.drop_duplicates(keep="first", subset=<list-of-columns-to-check-for-duplicates>)
.
Of course, it will not mark all duplicates (performance depends on how good you cleaned / normalized the text), but it will help out.
If the dataset is not awfully big, I would suggest to look into fuzzy string matching (https://pypi.org/project/fuzzywuzzy/) and manually compare each pair of samples. You can compare each pair of (cleaned) data, and get a similarity score. You can assume two places are duplicates if the score is close to 100.
Quick and dirty example
from fuzzywuzzy import fuzz
# without even cleaning the texts
D="Home 25,pearse strret,near spire"
E="Pearse,Home no 25,near spire"
F="Home 25,Pearse street,near spire"
print(fuzz.token_set_ratio(D,E), fuzz.token_set_ratio(D,F), fuzz.token_set_ratio(E,F))
# (94, 97, 94)
# all similarity ratios are over 90: duplicates
A="Apart 342,The abc,xyz"
print(fuzz.token_set_ratio(A,E))
# 24
# similarity score low: not duplicate
Upvotes: 2