Reputation: 2342
I'm quite new at hadoop, my question is simple: is there any research or statement to determinate how many nodes will use hadoop based on how many transactions (database transactions) and how many items (items in every transaction) I have?
Upvotes: 0
Views: 5515
Reputation: 39893
Disclaimer: This is a really hard question and could probably have a book written on the subject. Also, I have enough subjective opinion in here to make me nervous about documenting it on StackOverflow, so I hope this helps, but don't think that this is some sort of bible that you have to listen to.
Also, your question is a little off base for a Hadoop question. Hadoop rarely talks in terms of transactions and items. You put files in HDFS, not records (although those files can have records). And your number of items (records?) doesn't matter-- data size matters. Transactions in the traditional sense in Hadoop don't exist. I'll answer your question anyways, but you are throwing me some warning signs. Make sure Hadoop is right for what you are trying to do. People typically ask: how much data (in TB) do I need to put in HDFS? How many TB/day do I need to load into HDFS? How many GB does my MapReduce job need to process?
Here is some advice about hadoop that has served me well: Hadoop scales-out nicely. The code stays the same for 5 nodes or 500 nodes. The performance and storage scales pretty linearly. Try it out on 3-4 nodes and see what happens, then multiply that by what you really need.
Here are some guides that I sometimes point people to.
http://hortonworks.com/blog/how-to-size-your-hadoop-cluster/ -- this one from hortonworks is a little too high-level for my tastes, but it might help you out.
http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/ -- a decent article that explains some of the points to consider when sizing your cluster.
My rules of thumb (i.e., some of these are based on my opinion) on data storage:
Upvotes: 4