Jingguo Yao
Jingguo Yao

Reputation: 7986

How the chances of getting "read-your-writes" consistency are increased in Dynamo?

In Section 5 of Dynamo paper, there is the following content:

In particular, since each write usually follows a read operation, the coordinator for a write is chosen to be the node that replied fastest to the previous read operation which is stored in the context information of the request. This optimization enables us to pick the node that has the data that was read by the preceding read operation thereby increasing the chances of getting "read-your-writes" consistency.

How the chances of getting "read-your-writes" consistency are increased?

"read-your-writes" means that a read following a write gets the value set by the write. The read and the write are performed by two different clients for this context. The reason is that the choice of the write coordinator does not impact on the chances of getting "read-your-writes" by the same client.

But the above text is talking about a write following a read. Here is my guess. The read coordinator will try to do syntactic reconciliation if it is possible. If syntactic reconciliation is impossible because of divergent versions, the client need to do semantic reconciliation before doing a write. Either way, the versions on all the nodes involved in the read operation is an ancestor of the reconciled version. So the following write can be sent to any of them to get applied. The earliest time for a write to be seen by a read is after the following steps are finished:

The shorter the time to perform the above steps, the more likely another following read sees the new version. Since it is very possible that the node which replied fastest to the previous read can perform the following steps in a shorter time. Such a node is chosen as the write coordinator.

Upvotes: 1

Views: 257

Answers (2)

Jingguo Yao
Jingguo Yao

Reputation: 7986

I re-read the Dynamo paper. I have a new understanding of "read-your-write" consistency. "read-your-writes" involves only one client. Image the following requests performed by one client on the same key:

  1. read-1
  2. write-1
  3. read-2

"read-your-writes" means that read-2 sees write-1. The write coordinator has the best chance to have write-1. To ensure "read-your-writes", it is desired that the write coordinator replies fastest to read-2. It is highly possible that the node replies fastest to read-1 also reply fastest to read-2. So choose the node replies fastest to read-1 as the write coordinator.

And what is the node that replied fastest to the previous read operation? Such a node only makes sense if client-driven coordination is used. For server-side coordination, the coordinator nodes replies to the client and the other involved nodes reply to the coordinator node. replied fastest is meaningless in this case.

Upvotes: 1

Pete - MSFT
Pete - MSFT

Reputation: 4309

  1. Section 2.3 talks about performing the reconciliation at read time rather than write time.
  2. Data versioning - "One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks."
  3. This paragraph from section 4. [emphasis mine]

In Dynamo, when a client wishes to update an object, it must specify which version it is updating. This is done by passing the context it obtained from an earlier read operation, which contains the vector clock information. Upon processing a read request, if Dynamo has access to multiple branches that cannot be syntactically reconciled, it will return all the objects at the leaves, with the corresponding version information in the context. An update using this context is considered to have reconciled the divergent versions and the branches are collapsed into a single new version

So by performing the read first, you're effectively reconciling all divergent versions prior to writing. By writing to that same node, the version you've updated is marked with the context and vector clock of the most up to date version and all divergent branches can be collapsed. This is sent to the top N nodes (as you've stated) as fast as possible. But by removing the divergent branches - you reduce the chance that multiple values could be returned. You only need one of the N nodes read in the next read to get the reconciled write. ie - the node as part of the quorum of R reads says - "I am the reconciled version, and all others must bow to me". (and if that has already been distributed to another of the "R" nodes, then there's even greater chance of getting the reconciled version in the quorum)

But, if you wrote to a different node, one that you hadn't read from - the vector clock that is being updated may not necessarily be a reconciled version of the object. Therefore, you could still have divergent branches. The following read will try and reconcile it, but it's more more probable that you could have multiple divergent data and no reconciliation.

If you've made it this far, I think the most interesting part is that per Section 6, client applications can dictate the values of N, R and W - ie - number of nodes that constitute the pool to draw from, and the number of nodes that must agree on a read or write for it to be successful.

Geez - my head hurts now.

Upvotes: 1

Related Questions