ILikeFood
ILikeFood

Reputation: 560

Should hadoop clusters run on identical hardware?

I remember reading somewhere that Hadoop's performance deteriorates significantly if the machines it runs on are very different from one another, but I can't seem to find that comment anymore. I am considering running a Hadoop cluster on an array of VMs that is not directly managed by my group, and I need to know if this is a requirement that I should put in my request.

So, should I insist on all of my machines having identical hardware, or is it okay to run on different machines in different hardware configurations?

Thanks.

Upvotes: 10

Views: 3814

Answers (2)

s3cur3
s3cur3

Reputation: 3025

A homogenous cluster is certainly ideal, but it's not strictly necessary. Yahoo!, Inc., for example, runs heterogeneous clusters in their production environments. From talking with researchers there, they find that there is a performance hit due to scheduling issues (a big enough hit that they're working hard to add performance-aware scheduling to their tools), but the penalty is not crippling.

Upvotes: 4

pyfunc
pyfunc

Reputation: 66719

Following papers describes how heterogeneous cluster affect the performance of hadoop map-reduce:

In a heterogeneous cluster, the computing capacities of nodes may vary significantly. A high-speed node can finish processing data stored in a local disk of the node faster than low-speed counterparts. After a fast node complete the processing of its local input data, the node must support load sharing by handling unprocessed data located in one or more remote slow nodes. When the amount of transferred data due to load sharing is very large, the overhead of moving unprocessed data from slow nodes to fast nodes becomes a critical issue affecting Hadoop’s performance.

Following references has more details:

  1. http://computerresearch.org/stpr/index.php/gjcst/article/view/749/658
  2. http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia.pdf

It also provides ways in which you could improve the performance on heterogeneous cluster or avoid this performance penalty.

It is wisely suggested that you have homogenous machines on your cluster but if these machines do not have wildly different specifications and performance difference, you should carry on with building your cluster.

For production systems, you should suggest for homogenous machines. For development, performance is not critical.

How ever, you should be able to benchmark your Hadoop cluster after you have built it.

Upvotes: 14

Related Questions