Reputation: 15
I'm working on detecting duplicates in a list of around 5 million addresses, and was wondering if there was consensus on an efficient algorithm for such a purpose. I've looked at the Dedupe library on Gitbub (https://github.com/datamade/dedupe), but based on the documentation I'm not clear that this would scale to a large application well.
As an aside, I'm just looking to define duplicates based on textual similarity - have already done a lot of cleaning on the addresses. I've been using a crude method using Levenshtein distance, but was wondering if there's anything more efficient for large datasets.
Thanks,
Upvotes: 1
Views: 644
Reputation: 3249
Dedupe should work fine for data of that size.
There has been some excellent work by Michael Wick and Beka Steorts that have better complexity than dedupe.
Upvotes: 2