Evgeny
Evgeny

Reputation: 11

RIAK get http://host/riak/people/key "not found"

riak working on cluster from 2 servers.

riak-admin status | grep ring_members
ring_members : ['[email protected]','[email protected]',

write data in server1

for i in {1..1000}; do
 curl -i -XPOST 'http://server1.local:8098/riak/people/'$i'' -H 'Content-Type:application/json' -d '{"name":"aaron_'$i'"}'
done

Shutdown riak on server1: /etc/init.d/riak stop

and get data with server2

for i in {1..1000}; do
 curl -v -i http://server2.local:8098/riak/people/$i
done

10-30% data not read with first pass. data is read on Second pass.

First pass

curl -i http://server2.local:8098/riak/people/196

About to connect() to server2.local port 8098 (#0)
*   Trying 2.2.2.2... connected
* Connected to server2.local (2.2.2.2) port 8098 (#0)
> GET /riak/people/196 HTTP/1.1
> Host: server2.local:8098
> Accept: */*
< HTTP/1.1 404 Object Not Found
< Server: MochiWeb/1.1 WebMachine/1.10.0 (never breaks eye contact)
< Date: Thu, 27 Nov 2014 11:22:25 GMT
< Content-Type: text/plain
< Content-Length: 10

server2.local left intact

* Closing connection #0

  not found

Second pass

curl -i http://server2.local:8098/riak/people/196

* About to connect() to server2.local port 8098 (#0)
*   Trying 2.2.2.2... connected
* Connected to server2.local (2.2.2.2) port 8098 (#0)
> GET /riak/people/196 HTTP/1.1
> Host: server2.local:8098
> Accept: */*
< HTTP/1.1 200 OK
< X-Riak-Vclock: a85hYGBgzGDKBVIcypz/foYkbmfKYEpkzGNlCGh9fZYvCwA=
< Vary: Accept-Encoding
< Server: MochiWeb/1.1 WebMachine/1.10.0 (never breaks eye contact)
< Link: </riak/people>; rel="up" 
< Last-Modified: Thu, 27 Nov 2014 11:21:52 GMT
< ETag: "2C4oPFcSctzBX1mwHjjfQ1" 
< Date: Thu, 27 Nov 2014 11:25:47 GMT
< Content-Type: application/json
< Content-Length: 20

* Closing connection #0

 {"name":"aaron_196"}

Why does this happen?

Upvotes: 0

Views: 223

Answers (1)

Joe
Joe

Reputation: 28366

This is caused by a combination of eventual consistency, partition tolerance, and small cluster size.

Preflists

When you store a key, Riak stores a copy in 3 different virtual nodes (vnodes), collectively called the 'preflist` for that key.

The default value for number of vnodes is 64. So each of your nodes has 32 vnodes. When the cluster was created, the algorithm that assigns vnodes to physical nodes would have attempted to not have more than 1 vnode in any preflist reside on the same physical node - an impossible task when only 2 nodes are present.

So when you stored your values, 2 copies were written to one node, and one to the other.

Failure State

When a node fails, is shutdown, or otherwise becomes unavailable, the remaining nodes start fallback vnodes as necessary in order to accept operations that would have been handled by the missing node's vnodes. When the original node becomes available again, a hinted handoff process is triggered that causes the fallback vnodes to send all of their data to the primary vnode in its original location.

Quorum

Requests in Riak are subject to quorums of vnodes. Some commonly used quorums are:

  • read r - The minimum number of vnodes that must reply to a read request before it can return a value to the client.
  • write w - The minimum number of vnodes that must return success to a write before the client can be told it was successful
  • primary read pr - The same as r except that fallback vnodes may not be considered
  • primary write pw - The same as w except that fallback vnodes may not be considered

The default value for r and w quorums is (n_val / 2) + 1, 2 in the case of default n_val=3.

What happened

When you wrote your values to the 2 node cluster, 2 copies of each value were written to one node and 1 to the other. Which node received 2 copies would have varied between preflists. I would expect that the division was roughly half the keys you wrote were written twice to node 1 and once to node 2, and vice versa.

Then when you stopped node 1, all of its vnodes became simultaneously unavailable. Your first read request to node 2 would have caused it to start 1 or 2 fallback vnodes to replace the missing one(s) in the preflist. These fallbacks would not be able to respond until they finished starting up, so the first reply to the get process would have been from the primary vnode on node 2, which would have had the value. The get process would then wait for another reply (to satisfy the default r=2 quorum), which could have been a notfound from the newly started (and very empty) fallback vnode, after which it would return the value to the client.

The get process does not exit immediately after replying. It waits for responses from all of the vnodes, and if they don't match, it compares the returned values, selects the most current according to their vclocks, and sends the resolved value back out to the vnodes. That process is called read repair and is key to restoring consistency.

With each successive read it becomes more likely that the fallbacks needed were already needed by a previous read and are therefore already started.

Assuming you left the default backend at Bitcask, the fallbacks vnodes that didn't have the value would scan the in-RAM key directory, discover the requested key was not present, and return notfound. A vnode that has the data would scan the keydir, find the entry for the requested key with a reference and offset into the disk file containing the data, perform the disk read to get the data, and then send that along to the get process. You can infer from this that a notfound response can be generated much faster than a response with data simply because there is no disk access involved.

Thus, if all of the following are true:

  • all vnodes required by your read request are running
  • the key has not been requested since the fallbacks were started
  • the key was written twice to node 1 and once to node 2
  • the default n_val=3 and r=2 are used

Then the get process would receive the 2 notfound responses first, which satisfies the r=2 quorum, and reply notfound to the client. It would, in due time, also receive the third response with the data and write the value out to the fallback vnodes.

The next read request would then find all 3 vnodes populated and return the value.

How to Prevent this

To prevent this from happening either:

  • use a quorum that requires all vnodes in the preflist to respond (r=all)
  • use a quorum that requires an answer from a primary vnode (pr=1)
  • add nodes to the cluster until no preflist has 2 members on the same node.

When joining nodes to the cluster, in the plan stage you should see a warning to the effect of "Not all replicas will be on distinct nodes" if any node will contain 2 members of a preflist.

The algorithm that assigns vnodes to nodes uses the target_n_val setting, which is usually higher than n_val, to ensure that if a node fails, the first choice for a fallback vnode is not a node that already has another member of the preflist. The default for target_n_val is 4, and it is usually recommended that there be target_n_val + 1 nodes in the cluster to ensure the vnode assignment can be accomplished without overlaps.

Upvotes: 1

Related Questions