MongoDB sharding for data warehouse

Question

Sharding provide a scalable throughput and storage. Scalable throughput and storage is kind of a paradise for analytics. However there is a huge trade off that I think about.

If I use hashed shard key, - write will be very scalable - however, if I am doing sequential read for facts, it will be exhaustive since it has to access all server

If I use ranged shard key, e.g. using field A, - write might be scalable, if we are not using timestamp field - however, sequential read will not be scalable if we are not using field A

In my opinion, it won't be very scalable as a data warehouse. However, I have no idea what other solution to make mongoDB data warehouse scalable.

Does mongoDB sharding is really suitable to make data warehouse scalable?

Markus W Mahlberg · Accepted Answer

Erm, if you read a lot of data, it is most likely that you will exhaust the physical read capacity of one server. You want the reads to be done in parallel - unless I have a very wrong understanding of data warehousing and the limitations of the HDDs and SSDs around nowadays.

What you would do first is to select a subset of the data you want to analyze, right? If you have a lot of data, it makes sense that this matching is done in parallel. When the subset is selected, further analysis should be made, right? This is exactly what MongoDB does in the aggregation framework. An early match is done on all of the affected shards and the result is sent to the primary shard for that database, where further steps of the aggregation pipeline are applied.

MongoDB sharding for data warehouse

Answers (1)

Related Questions