soumyadeep sarkar
soumyadeep sarkar

Reputation: 572

What are virtual nodes? And how do they help during partitioning in Cassandra?

I know we can use Cassandra's virtual node facility so that we can prevent additional overhead of assigning token (start token) to different nodes of cluster. Instead of that we use num_tokens and its default value is 256.
In what way are these virtual nodes making difference in partitioning? Is Cassandra setting/assigning a token range (max and minimum token) for a particular node?

Upvotes: 17

Views: 10168

Answers (1)

Aaron
Aaron

Reputation: 57748

What is virtual nodes?

Prior to Cassandra 1.2, each node was assigned to a specific token range. Now each node can support multiple, non-contiguous token ranges. Instead of a node being responsible for one large range of tokens, it is responsible for many smaller ranges. In this way, one physical node is essentially hosting many smaller "virtual" nodes.

In what way these virtual nodes is making difference in partitioning?

Consider the image in this blog: Virtual nodes in Cassandra 1.2.

virtual nodes

Having many smaller token ranges (nodes) on each physical node allows for a more even distribution of data. This becomes evident when you add a physical node to the cluster, in that rebalancing (manually reassigning token ranges) is no longer necessary. As the Virtual Node documentation states, the new node "assumes responsibility for an even portion of data from the other nodes in the cluster."

Cassandra is setting/assigning token range(max and minimum token) for a particular node?

Yes, Cassandra predetermines the size of each virtual node. However, you can control the number of virtual nodes assigned to each physical node. Assume that your physical nodes are all configured for the default of 256 virtual nodes. If you add a new machine with more resources than your current nodes, and you want that machine to handle more load, you could configure it to allow 384 virtual nodes instead. Likewise, a machine with fewer resources could be configured to support a smaller number of virtual nodes.

Edit 20230628

I do not understand the relationship between vnode and partitioner (let's take murmur3).

A VNode's token range is calculated using the Murmur3 algorithm.

A partition key, once created must land on some vnode?

Yes.

How we ensure this vnode will have enough space on a disk?

We don't, but VNodes doesn't change that. As usual it's up to the DBA and Dev teams to work together on appropriately sizing the anticipated compute resource usage up-front. But with more, smaller ranges, generated tokens should be distributed more-evenly.

What if too many partition keys will land on the same vnode?

Then add another node to the cluster. The node add operation will bisect the current node's token ranges and reassign them to other nodes. This is no different than if we weren't using VNodes, although with VNodes there's a much lower chance of this becoming a problem.

The token creation algorithm is different from that of partitioning?

Yes! The token partition algorithm is one of either Murmur3 or MD5 (RandomPartitioner). The creation of Murmur3 tokens is faster than the RandomPartitioner, because the delivered MD5 hash in Java did a lot of other things that we just don't need.

Upvotes: 32

Related Questions