Reputation: 382
I read that map tasks usually run on the data residing on the same node for optimization reasons. But in my job tracker (hadoop admininstration page), i could see that the input split locations for a map task running in, say node1, are from node3, node1 and node4. I have totally 10 map tasks spawned and for some of them the input split location points to 3 different nodes other than the map task's node. Is this common and ok? Is it related to how I load my input files into hdfs and that they are equally distributed in the cluster? If this is not ok, how do i make sure the map tasks pick the data from the same node as far as possible?
Upvotes: 0
Views: 692
Reputation: 34184
Are some of your tasks taking longer than the others?If that is the case speculative execution
will come into the picture which is probably the reason behind this.
Tasks may be slow for various reasons, including hardware degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks; instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup. This is termed speculative execution
of tasks.
Speculative execution is turned on by default. However, it can be enabled or disabled independently for map tasks and reduce tasks, on a cluster-wide basis, or on a per-job basis.
Hope this answers the question.
P.S. : Speculative execution is an optimization, not a feature to make jobs run more reliably. If there are bugs that sometimes cause a task to hang or slow down, then relying on speculative execution to avoid these problems is unwise, and won’t work reliably, since the same bugs are likely to affect the speculative task. You should fix the bug so that the task doesn’t hang or slow down.
Upvotes: 1