Postman Bob
Postman Bob

Reputation: 11

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.

Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.

Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.

Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.

I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.

Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?

Upvotes: 1

Views: 567

Answers (1)

Edu
Edu

Reputation: 2643

I currently work with Cassandra and I might help with a few pros and cons.

Requirements

Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.

Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.

Consistency

In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:

  • ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).

  • QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.

Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.

Pros

Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.

Cons

The downsides of Cassandra are in the modeling and queries.

With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.

With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key). So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.

Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.

Denormalization is also common in Cassandra for the same reason: Performance.

Upvotes: 1

Related Questions