Reputation: 17676
How can use fuzzy matching in pandas to detect duplicate rows (efficiently)
How to find duplicates of one column vs. all the other ones without a gigantic for loop of converting row_i toString() and then comparing it to all the other ones?
Upvotes: 6
Views: 8738
Reputation: 401
pandas-dedupe is your friend here. You can try to do the following:
import pandas as pd
from pandas_deudpe import dedupe_dataframe
df = pd.DataFrame.from_dict({'bank':['bankA', 'bankA', 'bankB', 'bankX'],'email':['email1', 'email1', 'email2', 'email3'],'name':['jon', 'john', 'mark', 'pluto']})
dd = dedupe_dataframe(df, ['bank', 'name', 'email'], sample_size=1)
If you also want to set a canonical name to same entitites, set canonicalize=True
.
[I'm one of pandas-dedupe contributors]
Upvotes: 2
Reputation: 3249
There is now a package to make it easier to use the dedupe library with pandas: pandas-dedupe
(I am a developer of the original dedupe library, but not the pandas-dedupe package)
Upvotes: 0
Reputation: 3249
Not pandas specific, but within the python ecosystem the dedupe python library would seem to do what you want. In particular, it allows you to compare each column of a row separately and then combine the information into a single probability score of a match.
Upvotes: 6