uday sagar
uday sagar

Reputation: 87

hadoop - definitive guide - why is a block in hdfs so large

I came across the following paragraph from the definitive guide(HDFS Concepts - blocks) and could not understand.

Map tasks in MapReduce normally operate on one block at a time, so if you have too few tasks (fewer than nodes in the cluster), your jobs will run slower than they could otherwise.

I am wondering how the jobs would be slower when the tasks are few when compared to the total number of nodes in the cluster. Say there are 1000 nodes in the cluster and 3 tasks(By tasks I took blocks as each block is sent to a node for a single task), the time it takes to get the result will always be less than the scenario that has say 1000 nodes and 1000 tasks right?

I couldn't get convinced by the paragraph given in the definitive guide.

Upvotes: 0

Views: 59

Answers (1)

celik
celik

Reputation: 174

The paragraph you quoted from book basically says "utilize as much nodes as you can." If you have 1000 nodes and only 3 blocks or tasks, only 3 nodes are running on your tasks, and all other 997 nodes do nothing about your tasks. If you have 1000 nodes and 1000 tasks, and each of these 1000 nodes has some part of your data, all 1000 nodes will be utilized on you tasks. You also take advantage of data locality since each node will first work on local data.

Upvotes: 1

Related Questions