Python dedupe library for bigdata

Question

I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:

Deduplicate records (3.5 million)
Record link incremental data ~ 100K with ~1.1 million

Note: Everything is in memory on spark and DBFS.

I was able to run end to end dedupe on 60K records.
The program hangs for 100K records on the Dedupe.Clustor() method. Get a warning for max component nodes being limited to 30K

Summary of steps:

Block indexes
Pair(data) - 3.5 million pairs for 100K records
Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected
Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.

Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.

Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"

Python dedupe library for bigdata

Answers (1)

Related Questions