Naty Bizz
Naty Bizz

Reputation: 2342

Nodes required in Hadoop

I'm quite new at hadoop, my question is simple: is there any research or statement to determinate how many nodes will use hadoop based on how many transactions (database transactions) and how many items (items in every transaction) I have?

Upvotes: 0

Views: 5515

Answers (1)

Donald Miner
Donald Miner

Reputation: 39893

Disclaimer: This is a really hard question and could probably have a book written on the subject. Also, I have enough subjective opinion in here to make me nervous about documenting it on StackOverflow, so I hope this helps, but don't think that this is some sort of bible that you have to listen to.

Also, your question is a little off base for a Hadoop question. Hadoop rarely talks in terms of transactions and items. You put files in HDFS, not records (although those files can have records). And your number of items (records?) doesn't matter-- data size matters. Transactions in the traditional sense in Hadoop don't exist. I'll answer your question anyways, but you are throwing me some warning signs. Make sure Hadoop is right for what you are trying to do. People typically ask: how much data (in TB) do I need to put in HDFS? How many TB/day do I need to load into HDFS? How many GB does my MapReduce job need to process?

Here is some advice about hadoop that has served me well: Hadoop scales-out nicely. The code stays the same for 5 nodes or 500 nodes. The performance and storage scales pretty linearly. Try it out on 3-4 nodes and see what happens, then multiply that by what you really need.


Here are some guides that I sometimes point people to.

http://hortonworks.com/blog/how-to-size-your-hadoop-cluster/ -- this one from hortonworks is a little too high-level for my tastes, but it might help you out.

http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/ -- a decent article that explains some of the points to consider when sizing your cluster.


My rules of thumb (i.e., some of these are based on my opinion) on data storage:

  • I like my cluster's storage to be 50% utilized. Then, you have 3x replication. With MapReduce you'll need a decent amount of "scratch space" for temporary job output and data transformations that you are doing. This means you need 6x (2x and 3x) your base data storage: 10TB of data means you need 60TB of HDFS. Don't forget to compress your data.
  • Under 10 or so nodes, you can get away with all your master nodes on one node. Eventually you'll want separate nodes for master processes.
  • Next is job throughput. This one is really hard because it's hard to tell how much time it'll take for a task to run on your hardware that you don't have yet. Take a look at your theoretical disk throughput, multiply by the number of disks, then divide by two (to account for HDFS overhead). Then, do the math based on how long it takes to get your data set off disk and see if you are happy with that or not. If you aren't happy, you need more nodes.

Upvotes: 4

Related Questions