Jey Balachandran
Jey Balachandran

Reputation: 3935

Which is the right database for the job?

I am working on a feature and could use opinions on which database I should use to solve this problem.

We have a Rails application using MySQL. We have no issues with MySQL and it runs great. But for a new feature, we are deciding whether to stay MySQL or not. To simplify the problem, let's assume there is a User and Message model. A user can create messages. The message is delivered to other users based on their association with the poster.

Obviously there is an association based on friendship but there are many many more associations based on the user's profile. I plan to store some metadata about the poster along with the message. This way I don't have to pull the metadata each time when I query the messages.

Therefore, a message might look like this:

{
  id: 1,
  message: "Hi",
  created_at: 1234567890,
  metadata: {
    user_id: 555,
    category_1: null,
    category_2: null,
    category_3: null,
    ...
  }
}

When I query the messages, I need to be able to query based on zero or more metadata attributes. This call needs to be fast and occurs very often.

Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.

Personally, I have experience with MySQL and MongoDB. I've started research on Cassandra, HBase, Riak and CouchDB. I could use some help from people who might have done the research as to which database is the right one for my task.

And yes, the messages table can easily grow into millions or rows.

Upvotes: 1

Views: 1042

Answers (6)

BenG
BenG

Reputation: 1766

Riak can query as fast as you make it, depends on the nodes

Mongo will let you create an index on any field, even if that is an array

CouchDB is very different, it builds indexes using a stored Map-Reduce(but without the reduce) they call a "view"

RethinkDB will let you have SQL but a little faster TokuDB will too

Redis will kill all in speed, but it's entirely stored in RAM

single level relations can be done in all of them, but differently for each.

Upvotes: 1

Bruno Rohée
Bruno Rohée

Reputation: 3534

Due to the number of metadata attributes and the fact any number can be included in a query, creating SQL indexes here doesn't seem like a good idea.

It sounds like you need a join, so you can mostly forget about CouchDB till they sort out the multiview code that was worked on (not actually sure it is still worked on).

Upvotes: 1

Theo
Theo

Reputation: 132862

This is a very open ended question, so all we can do is give advice based on experience. The first thing to consider is if it's a good idea to decide on using something you haven't used before, instead of using MySQL, which you are familiar with. It's boring not to use shiny new things when you have the opportunity, but believe me that it's terrible when you've painted yourself in a corner because you though that the new toy would do everything it said on the box. Nothing ever works the way it says in the blog posts.

I mostly have experience with MongoDB. It's a terrible choice unless you want to spend a lot of time trying different things and realizing they don't work. Once you scale up a bit you basically can't use things like secondary indexes, updates, and other things that make Mongo an otherwise awesomely nice tool (most of this has to do with its global write lock and the database format on disk, it basically sucks at concurrency and fragments really easily if you remove data).

I don't agree that HBase is out of the question, it doesn't have secondary indexes, but you can't use those anyway once you get above a certain traffic load. The same goes for Cassandra (which is easier to deploy and work with than HBase). Basically you will have to implement your own indexing which ever solution you choose.

What you should consider is things like if you need consistency over availability, or vice versa (e.g. how bad is it if a message is lost or delayed vs. how bad is it if a user can't post or read a message), or if you will do updates to your data (e.g. data in Riak is an opaque blob, to change it you need to read it and write it back, in Cassandra, HBase and MongoDB you can add and remove properties without first reading the object). Ease of use is also an important factor, and Mongo is certainly easy to use from the programmer's perspective, and HBase is horrible, but just spend some time making your own library that encapsulates the nasty stuff, it will be worth it.

Finally, don't listen to me, try them out and see how they perform and how it feels. Make sure you try to load it as hard as you can, and make sure you test everything you will do. I've made the mistake of not testing what happens when you remove lots of data in MongoDB, and have paid for that dearly.

Upvotes: 4

Elad
Elad

Reputation: 3130

I think you're spot-on in storing metadata along with each message! Sacrificing storage for faster retrieval time is probably the way to go. Note that it could get complicated if you ever need to change a user's metadata and propagate that to all the messages. You should consider how often that might happen, whether you'll actually need to update all the message records, and based on that whether it's worth paying the price for the sake of less queries (it probably is worth it, but that depends on the specifics of your system).

I agree with @Andrej_L that Hbase isn't the right solution for this problem. Cassandra falls in with it for the same reason.

CouchDB could solve your problem, but you're going to have to define views (materialized indices) for any metadata you're going to want to query. If the whole point of not using MySQL here is to avoid indexing everything, then Couch is probably not the right solution either.

Riak would be a much better option since it queries your data using map-reduce. That allows you to build any query you like without the need to pre-index all your data as in couch. Millions of rows are not a problem for Riak - no worries there. Should the need arise, it also scales very well by simply adding more nodes (and it can balance itself too, so this is really a non-issue).

So based on my own experience, I'd recommend Riak. However, unlike you, I've no direct experience with MongoDB so you'll have to judge it agains Riak yourself (or maybe someone else here can answer on that).

Upvotes: 3

yojimbo87
yojimbo87

Reputation: 68305

I would recommend to look at presentation about Why databases suck for messaging which is mainly targeted on the fact why you shouldn't use databases such as MySQL for messaging.

I think in this scenario CouchDB's changes feed may come quite handy although you probably would also have to create some more complex views based on querying message metadata. If speed is critical try to also look at redis which is really fast and comes with pub/sub functionality. MongoDB with it's ad hoc queries support may also be a decent solution for this use case.

Upvotes: 3

Andrej
Andrej

Reputation: 7504

From my experience with Hbase is not good solution for your application. Because:

  1. Doesn't contain secondary index by default(you should install plugins or something like these). So you can effectively search only by primary key. I have implemented secondary index using hbase and additional tables. So you can't use this one in online application because of for getting result you should run map/reduce job and it will take much time on million data.

  2. It's very difficult to support and adjust this db. For effective work you will use HBAse with Hadoop and it's necessary powerful computers or several ones.

  3. Hbase is very useful when you need make aggregation reports on big amount of data. It seems that you needn't.

Upvotes: 2

Related Questions