Why increased amout of clusters speed up query in Hadoop's MapReduce?

Question

I just started learning Hadoop, in the official guide, it mentioned that double amount of

clusters is able to make querying double size of data as fast as original.

On the other hand, traditional RDBM still spend twice amount of time on querying result.

I cannot grasp the relation between cluster and processing data. Hope someone can give me

some idea.

Lauri Peltonen · Accepted Answer

It's the basic idea of distributed computing.

If you have one server working on data of size X, it will spend time Y on it. If you have 2X data, the same server will (roughly) spend 2Y time on it.

But if you have 10 servers working in parallel (in a distributed fashion) and they all have the entire data (X), then they will spend Y/10 time on it. You would gain the same effect by having 10 times more resources on the one server, but usually this is not feasible and/or doable. (Like increasing CPU power 10-fold is not very reasonable.)

This is of course a very rough simplification and Hadoop doesn't store the entire dataset on all of the servers - just the needed parts. Hadoop has a subset of the data on each server and the servers work on the data they have to produce one "answer" in the end. This requires communications and different protocols to agree on what data to share, how to share it, how to distribute it and so on - this is what Hadoop does.

Why increased amout of clusters speed up query in Hadoop's MapReduce?

Answers (1)

Related Questions

Why increased amout of clusters speed up query in Hadoop&#39;s MapReduce?

Answers (1)

Related Questions

Why increased amout of clusters speed up query in Hadoop's MapReduce?