Reputation: 83
I am wondering what is the best data structure available that supports concurrent writes and is particularly suited for write intensive applications. The data-structure is supposed to be a hash map and would be disk based. One idea I have in my mind to increase the level of concurrency is to have a fixed number of hash buckets and within each bucket, a tree based data structure for storing key-value pairs. However I am unsure about the tree-based data structure to use. Also, I need information pertaining to data-structure not some library that offers concurrent hash map since I need to implement it myself and we can't use user space libraries in kernel space. Is there any data structure that permits record level locking such that I won't have the lock down the whole data-structure while write is taking place. Any ideas or suggestions ? C would be used as language for implementation.
Upvotes: 1
Views: 444
Reputation: 18148
If the table is disk-based then you'll probably want to use a closed hashing scheme, e.g. Hopscotch Hashing; this is because an open hashing scheme usually requires linked data structures, which will be more difficult to (de)serialize to/from disk.
You can then increase concurrency by sharding the hash table and buffering its writes. For example, you can create 256 separate hash tables, each in its own file and lock.
struct Shard {
Lock lock
File table
}
Now you create a global array of 256 shards and a pool of worker threads. You'll need some way to evenly distribute the workload across the threads, for example a circular queue of threads that skips threads that are currently performing an expensive operation such as flushing a queue to disk.
struct WorkerThread {
Queue[] queues = Queue[256]
}
Shards[] shards = Shards[256]
WorkerThread[] threads = WorkerThread[8]
void processInput(WorkerThread thread, Data data) {
int modulus = hashValue(data) % 255
Queue queue = thread -> queues[modulus]
addToQueue(queue, data)
if(sizeOf(queue) > 64) {
Shard shard = shards[modulus]
lock(shard -> lock)
writeToDisk(shard -> file, queue)
unlock(shard -> lock)
}
}
Each thread maintains thread-local buffers of input data, indexed on n % 255; when a buffer exceeds 64 items, then the thread locks the appropriate shard and flushes the buffer to disk. If the number of threads is significantly smaller than the number of shards then you shouldn't run into very much lock contention, and if the thread-local buffers are large enough then you shouldn't run into very much disk contention.
You'll need to keep the shards relatively small in order to reduce the cost of IO, since you may be reading and writing the entirety of a shard's file during every writeToDisk
operation. In order to increase the number of shards you can either globally re-shard (e.g. increase the number of shards from 256 to 1024), which might require a stop the world operation; a more complicated but likely more performant alternative is to perform two or more levels of sharding (so the first level uses n % 255 to index the data, and then if a shard becomes too large then the next level uses n / 256 % 15 or something along those lines)
Upvotes: 1