mac
mac

Reputation: 647

How are nodes decided for replication in Cassandra

I am trying to understand how exactly data is replicated on multiple nodes in Cassandra. Lets assume we have 6 nodes and replication factor is 3. For all simplicity, lets assume single datacenter and single rack. Since RF is 3,data is stored in 3 replicas. I want to understand how the 3 replicas are decided.

Referring to example in http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2 (first image second part i.e, with virtual nodes), lets say our row falls under virtual node 'E' as decided by partitioner. So the row must be present in Node 1, 5, 6 according to distribution of virtual nodes among different nodes.

But coming to documentation - http://docs.datastax.com/en/cassandra/2.1/cassandra/architecture/architectureDataDistributeReplication_c.html , it says even in simple case of SimpleStrategy, first replica on a node is determined by the partitioner. Additional replicas are placed on the next nodes clockwise to the ring. So will data be stored in E, F, G virtual nodes or may be Node 1, 2, 3 ?

Which one is correct ? 1st link or documentation ?

Thanks!

Upvotes: 3

Views: 2395

Answers (2)

Marko Švaljek
Marko Švaljek

Reputation: 2101

And if it really interest you where your partition data ends up in the cluster you can use:

nodetool getendpoints

https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsGetEndPoints.html

Please take into account that documentation is simplified so that people understand it easier when seeing for the first time. In reality it's consistent hashing on steroids.

Previously every node had a single token and tokens were boundaries on ring that was used for consistent hashing. Basically you had a whole range divided into number of nodes that you had in the cluster. When you needed to do an operation on some partition, you took partition key, hashed it and then you knew to which node to go to. Basically after hashing you get a number in a range of -2^63 to 2^63 - 1. Then you go clockwise on the ring until you "find" a marker and this is how you know to which node a partition belongs initially. If you have greater replication factor, you just continue going clockwise on the ring until you "find" all the nodes that you need to satisfy replication factor. And this is how you know what nodes in the cluster have your partition.

With virtual nodes there is a property num_tokens and every node selects that many random tokens (In range previously mentioned) when joining the ring and they are then used for consistent hashing. Basically every node then sees that new node wants to have portions of the ring and streams the data to it. Also when new writes comes in they are sent to the new node that is going to own them (until the node fully joins the ring, it's responses are ignored when counted up for consistency levels).

This is how it was before (single token per node in cluster): Standard Consistent Hashing

This is how the ring looks like with virtual nodes: Consistent Hashing with vnodes

Absolutely the same rules apply with virtual nodes and ordinary consistent hashing, you go around the ring to select the replicas. If during your going around the ring you stumble upon the same node again you just skip it and continue until you find all the nodes that own the data according to the replication factor that you desire.

Upvotes: 2

JayK
JayK

Reputation: 815

Both are correct but I can understand the confusion. Let me explain:

In this case your row falls into a certain range. The partitioner knows that one node is primarily responsible for this range. It does not know the other nodes. However it can infer the other nodes based on the first node.

In this case, the first node is Five. It holds the token range E. Now let us ponder this statement.

Additional replicas are placed on the next nodes clockwise to the ring.

If you are using SimpleStrategy the next nodes are chosen clockwise from the first node. Which in this case is six and one. One is chosen because the token range wraps around from max to min.

Note that the nodes are clockwise arranged. Five,Six and lastly One. As the token range wraps around from max to min.

This is what the picture in the first link tries to explain by giving the 3 nodes the E token range. Some nodes are responsible for this token range because they inherit rows from an earlier node. They are responsible for certain ranges because they are next in line.

Upvotes: 1

Related Questions