Reputation: 43
In the Dynamo paper, the author introduced 3 different partitioning strategy:
It seems DynamoDB has evolved from strategy 1 to strategy 3. I have a few questions related to strategy 3:
Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery.
How is it managed at low level? One node can have a few partitions assigned to it. Is each partition handled separately inside the storage engine? For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces? This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage? For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node? How do we handle this situation in DynamoDB?
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens
and initial_token
settings in the cassandra.yml
, I believe it's strategy 1, am I wrong?
Upvotes: 1
Views: 418
Reputation: 27294
Trying to answer each question in turn:
One node can have a few partitions assigned to it.
Each node will have 1 or more token ranges assigned during bootstrapping - depending on the partitioner this is a numeric range -2^63 to +2^63 for the murmur or 0 to 2^128 for random partitioner.
Each token here can contain a partition (but might not), so while you are thinking of it as the node owning partitions, strictly speaking it is owning token ranges.
Is each partition handled separately inside the storage engine?
This question doesn't really follow - an SSTable can contain 1 or more partitions. A partition can be contained in 1 or more SSTables - e.g. the partition span SSTables.
For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces?
No, there will be a memtable for the database table, and then these are flushed to create the SSTables - the compaction of the multiple SStables is determined by the compaction strategy setting, with there being quite different behaviours and advantages / disadvantages to each, depending on the usage scenario. 1 size, does not fit all. Again, each SSTable can contain multiple partitions, and a partition can appear in more than 1 SSTable.
This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
Compaction itself is not a trivial topic, but since the initial premise is not correct, it has not introduced this.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage?
Writing specifically about Cassandra - every time you add or remove a node the token ranges that belong to each node can and will alter. So it is not entirely 'static', but it is not easy to change or manipulate either.
For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node?
Again - specific to Cassandra, in theory yes - you calculate the hash value of the partition key, and use initial_token values on a node to give it a very narrow range. In practice no - this is a data model design issue, by the fact that its partitioned in a way which has created a hot spot.
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Using num_tokens, which creates vNodes - is in effect dividing the consistent hash ring up more times, so 10 nodes, num_tokens = 16, means that the overall token range is divided into 160 slices, with each node having 10 of them as their partition range. They will hold replicas of other node's ranges of course based on replication factor and rack assignments. If you only had RF=1, then they would only be storing data for the range(s) they are assigned.
Initial_tokens is the setting to control the initial value when the node is bootstrapped - you can choose to calculate it and set it manually, or you can let the partitioner calculate it for you. Further changes on that setting after bootstrap will not have an impact.
Upvotes: 1