Reputation: 757
We have cluster of 5 Riak nodes. We use riak-java-client for work with Riak, usually for create, read, delete data. Unfortunately we have a problem with read old objects from riak. Sometime when we read object by key we getting null. When we trying repeat reading we getting correct object by the same key. It's look like very strange. What should we do, how can diagnose this problem? Any ideas, please? This is code for reading object from Riak:
String uuid = <keyInRiak>;
Namespace bucket = new Namespace("default", "default");
Location location = new Location(bucket, uuid);
FetchValue fetchValue = new FetchValue.Builder(location).build();
FetchValue.Response response = riakClient.execute(fetchValue);
if (!response.isNotFound()) {
RiakObject riakObject = response.getValue(RiakObject.class);
if(riakObject != null) {
BinaryValue binaryValue = riakObject.getValue();
byte[] result = binaryValue != null ? binaryValue.getValue() : null;
... process result ...
}
}
Upvotes: 0
Views: 141
Reputation: 524
From the sound of it, you may have AAE turned off and have force replaced a node at some point in the past without running a repair. Why do I say this?
By default, Riak has n_val=3
, meaning that three copies of your data are stored in the cluster. When reading, to improve throughput, Riak will return only the first node response to the client i.e. it asks 3 nodes and only sends the first reply back to the client.
Riak automatically repairs data on read using read-repair
. When the three values are read, it compares them and, if one is different to the other two, overwrites the one with the data from the other two to make all three consistent (yes, it's a bit more complex than this involving vclocks but we don't need to go into that much detail for this case). If you are running AAE then AAE quietly reads every single piece of your data in the background and so these read repairs happen automatically, meaning this issue should not happen.
If AAE is turned off, then on a node failure e.g. hard disks grow old and die, the copy of the data on that node is lost. There is no clean way to replace the node, so you have to do a force replace. In the force replace procedure, the new node is allocated empty partitions with the correct namespaces corresponding to the old node. These empty partitions will not be populated until read-repair does this i.e. the data needs to be read by a user, AAE or a partition repair.
Assuming you didn't do a partition repair and AAE is turned off, after a force replace, if you try to read information that is stored on that node, it will return a NULL value - they key exists but the value is not yet populated. If you read it again, read-repair should have fixed this issue or, thanks to a node balancer, your read will land on a different node that does have the data. This is how it reads correctly second time for old data. Any data created after the force replace would be fully populated on all nodes, which is why this issue only affects old data.
If you know which node(s) were force replaced in the past, you should probably try running an all partition repair on them as detailed here https://www.tiot.jp/riak-docs/riak/kv/2.2.6/using/repair-recovery/repairs/#repairing-all-partitions-on-a-node . Please note that this will increase the load on your cluster whilst the repair is running. The other option would be to turn on AAE and wait for it to cover your entire cluster's data (usually a week or so depending on load and volume of data).
Upvotes: 0