econ_enthusiast
econ_enthusiast

Reputation: 15

Efficient Algorithm for Detecting Text Duplicates in Big Dataset

I'm working on detecting duplicates in a list of around 5 million addresses, and was wondering if there was consensus on an efficient algorithm for such a purpose. I've looked at the Dedupe library on Gitbub (https://github.com/datamade/dedupe), but based on the documentation I'm not clear that this would scale to a large application well.

As an aside, I'm just looking to define duplicates based on textual similarity - have already done a lot of cleaning on the addresses. I've been using a crude method using Levenshtein distance, but was wondering if there's anything more efficient for large datasets.

Thanks,

Upvotes: 1

Views: 644

Answers (1)

fgregg
fgregg

Reputation: 3249

Dedupe should work fine for data of that size.

There has been some excellent work by Michael Wick and Beka Steorts that have better complexity than dedupe.

Upvotes: 2

Related Questions