Reputation: 1
I am working running the Dedupe package on large datasets (4 million records/ 5 fields) with the following objectives:
Note: Everything is in memory on spark and DBFS.
Summary of steps:
Block indexes
Pair(data) - 3.5 million pairs for 100K records
Score Pairs() - works fine, tested for 2 million input records and score pairs worked as expected
Dedupe.Clustor(score(pair)) - hangs with the below error, anytime I try to pass more than 60K records.
Kindly suggest any pointers or big data examples that I can refer. MySQL is currently not the primary plan.
Warning: "3730000 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0 A component contained 89927 elements. Components larger than 30000 are re-filtered. The threshold for this filtering is 0.0"
Upvotes: 0
Views: 729
Reputation: 1
we are now using postgresql approach -> Refer : https://github.com/dedupeio/dedupe-examples/tree/master/pgsql_big_dedupe_example
Version used - 2.0.13
Total records 18K with 16 core, 64 GIG RAM its is taking 20 mins to run along with manual labelling without any memory crash.
First issue version 2.0.14 is throwing error on some compatibility issue (discussed here on different threads)
Also 2.0.14 was giving slow performance ..
If you running with > 10K data postgresql will give better performance .
Upvotes: 0