Ravenwater
Ravenwater

Reputation: 751

What is the typical relationship between compute and storage capacity for large scale Hadoop clusters?

I am looking at dimensioning a large cluster (10k cores) that needs to support both compute bound deep analytics as well as I/O bound big data, and I want to hear from some folks that have built a big data cluster what they used to dimension the computes versus the local disk storage. I am assuming a direct attached storage architecture as advocated by on-line MapReduced based data warehouses

Looking at some medium density blade equipment anno 2012, such as dual Xeon 5650s, I can put roughly about 2TB per server as direct attached storage. That would give me about 100TFlops per 2TB of storage, or a 5:1 ratio. Lower density equipment can be has low as 1:1, higher density equipment can be as high as 10:1.

I would be interested to hear what ratios other big data folks are running.

Upvotes: 4

Views: 399

Answers (2)

Ravenwater
Ravenwater

Reputation: 751

From Praveen's third article from Eric Baldeschwieler at HortonWorks dated Sept 2011:

We get asked a lot of questions about how to select Apache Hadoop worker node hardware. During my time at Yahoo!, we bought a lot of nodes with 6*2TB SATA drives, 24GB RAM and 8 cores in a dual socket configuration. This has proven to be a pretty good configuration. This year, I’ve seen systems with 12*2TB SATA drives, 48GB RAM and 8 cores in a dual socket configurations. We will see a move to 3TB drives this year.

What configuration makes sense for any given organization is driven by such ratios as the storage-to-compute ratio of your workload and other factors that cannot be answered in a generic way. Further, the hardware industry moves quickly. In this post I’ll try to outline the principles that have generally guided Hadoop hardware configuration selections over the last six years. All of these thoughts are aimed at designing medium to large Apache Hadoop clusters. Scott Carey made a good case for smaller machines for small clusters the other day on the Apache mailing list.

Upvotes: 1

Praveen Sripati
Praveen Sripati

Reputation: 33545

Here are some articles 1 2 3 to start with for Hadoop hardware sizing.

Upvotes: 2

Related Questions