BMBM
BMBM

Reputation: 16033

Riak performance - unexpected results

In the last days I played a bit with riak. The initial setup was easier then I thought. Now I have a 3 node cluster, all nodes running on the same vm for the sake of testing.

I admit, the hardware settings of my virtual machine are very much downgraded (1 CPU, 512 MB RAM) but still I am a quite surprised by the slow performance of riak.

Map Reduce

Playing a bit with map reduce I had around 2000 objects in one bucket, each about 1k - 2k in size as json. I used this map function:

function(value, keyData, arg) {
    var data = Riak.mapValuesJson(value)[0];

    if (data.displayname.indexOf("max") !== -1) return [data];
    return [];
}

And it took over 2 seconds just for performing the http request returning its result, not counting the time it took in my client code to deserialze the results from json. Removing 2 of 3 nodes seemed to slightly improve the performance to just below 2 seconds, but this still seems really slow to me.

Is this to be expected? The objects were not that large in bytesize and 2000 objects in one bucket isnt that much, either.

Insert

Batch inserting of around 60.000 objects in the same size as above took rather long and actually didnt really work.

My script which inserted the objects in riak died at around 40.000 or so and said it couldnt connect to the riak node anymore. In the riak logs I found an error message which indicated that the node ran out of memory and died.

Question

This is really my first shot at riak, so there is definately the chance that I screwed something up.

It would really help me a lot if anyone with more experience in riak could help me out with some of these questions.

Upvotes: 16

Views: 11734

Answers (3)

DNA
DNA

Reputation: 42617

I don't have direct experience of Riak, but have worked with Cassandra a little, which is similar.

Firstly, performance will probably depend a lot on the number of cores available, and the memory. These systems are usually heavily pipelined and concurrent and benefit from a lot of cores. 4+ cores and 4GB+ of RAM would be a good starting point.

Secondly, MapReduce is designed for batch processing, not realtime queries.

Riak and all similar Key-Value stores are designed for high write performance, high read performance for simple lookups, no complex querying at all.

Just for comparison, Cassandra on a single node (6 core, 6GB) can do 20,000 individual inserts per second.

Upvotes: 2

MightyE
MightyE

Reputation: 2679

A recommendation I'd have now that some time has passed and several new versions of Riak have come about is this. Never rely on full bucket map/reduce, that's not an optimized operation, and chances are very good there are other ways to optimize your map/reduce so you don't have to look through so much data to pull out the singlets you need.

Secondary indices now available in newer versions of Riak are definitely the way to go in this regard. Put an index on the objects you want to find (perhaps named 'ismax_int' with a value of 0 or 1). You can map/reduce a secondary index with hundreds of thousands of keys in microseconds which a full bucket scan would have taken multiple seconds to consider.

Upvotes: 4

Alexander Staubo
Alexander Staubo

Reputation: 3373

This answer is a bit late, but I want to point out that Riak's mapreduce implementation is designed primarily to work with links, not entire buckets.

Riak's internal design is actually pretty much optimized against working with entire buckets. That's because buckets are not considered to be sequential tables but a keyspace distributed across a cluster of nodes. This means that random access is very fast — probably O(log n), but don't quote me on that — whereas serial access is very, very, very slow. Serial access, the way Riak is currently designed, necessarily means asking all nodes for their data.

Incidentally, "buckets" in Riak terminology are, confusingly and disappointingly, not implemented the way you probably think. What Riak calls a bucket is in reality just a namespace. Internally, there is only one bucket, and keys are stored with the bucket name as a prefix. This means that no matter how small or large you bucket is, enumerating the keys in a single bucket of size n will take m time, where m is the total number of keys in all buckets.

These limitations are implementation choices by Basho, not necessarily design flaws. Cassandra implements the exact same partitioning model as Riak, but supports efficient sequential range scans and mapreduce across large amounts of keys. Cassandra also implements true buckets.

Upvotes: 31

Related Questions