Max
Max

Reputation: 2859

Most suitable data store for billions of indexes

So we're looking to store two kinds of indexes.

  1. First kind will be in the order of billions, each with between 1 and 1000 values, each value being one or two 64 bit integers.
  2. Second kind will be in the order of millions, each with about 200 values, each value between 1KB and 1MB in size.

And our usage pattern will be something like this:

Now we've considered quite a few databases, our favourites at the moment are Cassandra and PostreSQL. However, our application is in Erlang, which has no production-ready bindings for Cassandra. And a major requirement is that it can't require too much manpower to maintain. I get the feeling that Cassandra's going to throw up unexpected scaling issues, whereas PostgreSQL's just going to be a pain to shard, but at least for us it's a know quantity. We're already familiar with PostgreSQL, but not hugely well acquainted with Cassandra.

So. Any suggestions or recommendations as to which data store would be most appropriate to our use case? I'm open to any and all suggestions!

Thanks,

-Alec

Upvotes: 0

Views: 210

Answers (2)

DNA
DNA

Reputation: 42617

You haven't given enough information to support much of an answer re: your index design. However, Cassandra scales up quite easily by growing the cluster.

You might want to read this article: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

A more significant issue for Cassandra is whether it supports the kind of queries you need - scalability won't be the problem. From the numbers you give, it sounds like we are talking about terabytes or tens of terabytes, which is very safe territory for Cassandra.

Upvotes: 2

AlfredoVR
AlfredoVR

Reputation: 4307

Billions is not a big number by todays standards, why not writing a benchmark instead of guesswork? That will give you a better decision tool and it's really easy to do. Just install your target OS, and each database engine, then run querys with let's say Perl (because i like it) It won't take you more than one day to do all this, i've done something like this before. A nice way to benchmark is writing a script that randomly , or with something like a gauss bell curve, executes querys, "simulating" real usage. Then plot the data or do it like a boss and just read the logs.

Upvotes: 2

Related Questions