Amount of data storage : HDFS vs NoSQL

Question

In several sources on the internet, it's explained that HDFS is built to handle a greater amount of data than NoSQL technologies (Cassandra, for example). In general when we go further than 1TB we must start thinking Hadoop (HDFS) and not NoSQL.

Besides the architecture and the fact that HDFS supports batch processing and that most NoSQL technologies (e.g. Cassandra) perform random I/O, and besides the schema design differences, why can't NoSQL Solutions (again, for example Cassandra) handle as much data as HDFS?

Why can't we use a NoSQL technology as a Data Lake? Why should we only use them as hot storage solutions in a big data architecture?

doanduyhai · Accepted Answer

why can't NoSQL Solutions (... for example Cassandra) handle as much data as HDFS?

HDFS has been designed to store massive amounts of data and support batch mode (OLAP) whereas Cassandra was designed for online transactional use-cases (OLTP).

The current recommendation for server density is 1TB/node for spinning disk and 3TB/node when using SSD.

In the Cassandra 3.x series, the storage engine has been rewritten to improve node density. Furthermore there are a few JIRA tickets to improve server density in the future.

There is a limit right now for server density in Cassandra because of:

repair. With an eventually consistent DB, repair is mandatory to re-sync data in case of failures. The more data you have on one server, the longer it takes to repair (more precisely to compute the Merkle tree, a binary tree of digests). But the issue of repair is mostly solved with incremental repair introduced in Cassandra 2.1
compaction. With an LSM tree data structure, any mutation results in a new write on disk so compaction is necessary to get rid of deprecated data or deleted data. The more data you have on 1 node, the longer is the compaction. There are also some solutions to address this issue, mainly the new DateTieredCompactionStrategy that has some tuning knobs to stop compacting data after a time threshold. There are few people using DateTiered compaction in production with density up to 10TB/node
node rebuild. Imagine one node crashes and is completely lost, you'll need to rebuild it by streaming data from other replicas. The higher the node density, the longer it takes to rebuild the node
load distribution. The more data you have on a node, the greater the load average (high disk I/O and high CPU usage). This will greatly impact the node latency for real time requests. Whereas a difference of 100ms is negligible for a batch scenario that takes 10h to complete, it is critical for a real time database/application subject to a tight SLA

Amount of data storage : HDFS vs NoSQL

Answers (1)

Related Questions