Efficient Algorithm for Detecting Text Duplicates in Big Dataset

Question

I'm working on detecting duplicates in a list of around 5 million addresses, and was wondering if there was consensus on an efficient algorithm for such a purpose. I've looked at the Dedupe library on Gitbub (https://github.com/datamade/dedupe), but based on the documentation I'm not clear that this would scale to a large application well.

As an aside, I'm just looking to define duplicates based on textual similarity - have already done a lot of cleaning on the addresses. I've been using a crude method using Levenshtein distance, but was wondering if there's anything more efficient for large datasets.

Thanks,

fgregg · Accepted Answer

Dedupe should work fine for data of that size.

There has been some excellent work by Michael Wick and Beka Steorts that have better complexity than dedupe.

Efficient Algorithm for Detecting Text Duplicates in Big Dataset

Answers (1)

Related Questions