Reputation: 169
I am having a hard time storing hundreds of millions of key/value pairs of 16/32bytes with a hash array on my SSD.
With Kyoto Cabinet: When it works fine, it inserts at 70000 record/s. Once it drops, it goes down to 10-500 records/s. With the default settings, the drop happens after around a million records. Looking at the documentation, that is the default number of buckets in the array, so it makes sense. I increased this number to 25 millions and indeed, it works fine until around 25 millions records. Problem is, as soon as I push the number of buckets to 30 millions or over, the insert rate is down to 10-500 records/s from the beginning. Kyoto Cabinet is not designed to increase the number of bucket after the database is created, so I cannot insert more than 25 millions records.
1/ Why would KC's insert rate get very low once the bucket number exceeds 25M ?
With Berkeley DB: The best speed I got is slightly lower than KC, closer to 50000 record/s, but still ok. With the default settings, just like KC, the speed drops suddenly after around a million records. I know BDB is designed to extend gradually its number of buckets. Regardless of that, It tried to increase the initial number, playing with HashNumElements and FillFactor, but any of these attempts made the situation worst. So I still cannot insert over 1-2 millions records with DBD. I tried activating non-synchronized transactions, tried different rates of checkpoints, increased caches. Nothing improves the drop down.
2/ What could cause BDB's insert rate to drop after 1-2 million inserts ?
Note: I'm working with java, and when the speed is dropping, the CPU usage lowers to 0-30% while at 100% when working at correct speed.
Note: Stopping the process and resuming the insertion changes nothing. So I don't think that is related to memory limits or garbage collection.
Thx.
Upvotes: 7
Views: 2278
Reputation: 169
Below is how I managed to store billions of records despite the writing limitations encountered with KC.
With much effort, I still haven't solved the problem for neither Kyoto Cabinet nor Berkeley DB. However I came up with an interesting workaround using Kyoto Cabinet.
I noticed I cannot write more than 25M records on one KC file, but read has no such limitation −it is always fast, regardless of the size of the database. The solution I found is to create a new KC file (a new database) for every 25M new records. That way the reading happens on many KC files and is still fast, and the writing happens only on the last created file and is fast as well. Only remaining problem was to allow update/deletion of the records on the previous files. For that, I copied the SSTables approach, which is :
Notes :
Upvotes: 3