Joe121
Joe121

Reputation: 62

Delete duplicate pandas

I have a Pandas dataframe

ID,col_1,col_2,col_3,PostCode,City,Country
22,Apart 342,The abc,xyz,X01909,London,UK
33,Apt 342,xyz,The abc,X01909,London,UK
41,Apt 342,The abc,xyz,X01909,London,UK

Above records are duplicates and one column value is present in other column, I need to delete these duplicates on a very large duplicates.

I have tried df=df.drop_duplicates(keep='first')

but it doesn't consider mis spelled data

More Sample

ID,col_1,col_2,col_3,PostCode,City,Country
1,Home 25,pearse strret,near spire, 4892,Manchester,UK
2,Pearse,Home no 25,near spire,4892,Manchester,UK
3,Home 25,Pearse street,near spire,4892,Manchester,UK

Upvotes: 0

Views: 79

Answers (1)

beatrice-portelli
beatrice-portelli

Reputation: 146

Pandas drop_duplicates can only perform exact comparisons. So if you compare The abc, xyz and xyz, The abc it will see them as non-duplicates. Same goes for capitalization ("the abc" and "The Abc") and misspellings.

The presence of misspelling makes this virtually impossible to solve this perfectly.

If you want to use drop_duplicates I would suggest to try and clean the data as much as you can:

  • lowercase everything
  • create/get a dictionary of abbreviations for your field and expand them (e.g., replace "apt" with "apartment", "no" with "number", ...)
  • create a new column, col_4, that somehow combines the data of col_1, col_2 and col_3. This helps in case the information is the same, but switched between columns. (you could concatenate everything, split by word and sort alphabetically...)

After this, you can try and use df.drop_duplicates(keep="first", subset=<list-of-columns-to-check-for-duplicates>).

Of course, it will not mark all duplicates (performance depends on how good you cleaned / normalized the text), but it will help out.

If the dataset is not awfully big, I would suggest to look into fuzzy string matching (https://pypi.org/project/fuzzywuzzy/) and manually compare each pair of samples. You can compare each pair of (cleaned) data, and get a similarity score. You can assume two places are duplicates if the score is close to 100.

Quick and dirty example

from fuzzywuzzy import fuzz

# without even cleaning the texts
D="Home 25,pearse strret,near spire"
E="Pearse,Home no 25,near spire"
F="Home 25,Pearse street,near spire"

print(fuzz.token_set_ratio(D,E), fuzz.token_set_ratio(D,F), fuzz.token_set_ratio(E,F))
# (94, 97, 94)
# all similarity ratios are over 90: duplicates

A="Apart 342,The abc,xyz"
print(fuzz.token_set_ratio(A,E))
# 24
# similarity score low: not duplicate

Upvotes: 2

Related Questions