Reputation: 518
Im currently building a database which will have a millions even billions of records. The issue is the files that im using are usually like 30GB big and if you combine them there are duplicates records. I've only got 64GB of ram and it would be not possible to remove the duplicates with loading the lines in to the ram. I've tried the Unique index but inserting gets really slow after a while. Is there any way to remove the duplicates in a efficent way?
Record example:
{
"_id": {
"$oid": "5fabbb10364524e054d629b4"
},
"hash": "599e7b7fb49c772d93b7fc96020d9a13",
"cleartext": "starocean40"
}
Upvotes: 0
Views: 122
Reputation: 1306
You don't have to keep the whole dataset in memory to find duplicates, instead you can just store a set of record hashes.
MD5, for instance, uses 128-bit hashes. Assuming 1000000 records, this amounts to 16MB + some overhead. Mind you, you would still need to compare the records whose hashes match - it's possible 2 differing records have the same hash.
So, when importing the files you would compute a hash of each of the records, check the Python set of previously-seen hashes.
If a matching hash is found, you would scan the whole DB to double-check a matching record exists.
If a matching hash is not found, you can be 100% sure this record was not yet imported, so you can import it and store its hash into the in-memory set of hashes.
Alternatively, you can use Mongo's hashed indexes for a similar effect.
Upvotes: 1