Reputation: 29123
I ran a hadoop job and when I look in some map tasks I see they are not running where the file's blocks are. E.g., the map task runs on slave1, but the file blocks (all of them) are in slave2. The files are all gzip.
Why is that happening and how to resolve?
UPDATE: note there are many pending tasks, so this is not a case of a node being idle and therefore hosting tasks that read from other nodes.
Upvotes: 5
Views: 726
Reputation: 4575
Hadoop's default (FIFO) scheduler works like this: When a node has spare capacity, it contacts the master and asks for more work. The master tries to assign a data-local task, or a rack-local task, but if it can't, it will assign any task in the queue (of waiting tasks) to that node. However, while this node was being assigned this non-local task (we'll call it task X), it is possible that another node also had spare capacity and contacted the master asking for work. Even if this node actually had a local copy of the data required by X, it will not be assigned that task because the other node was able to acquire the lock to the master slightly faster than the latter node. This results in poor data locality, but FAST task assignment.
In contrast, the Fair Scheduler uses a technique called delayed scheduling that achieves higher locality by delaying non-local task assignment for a "little bit" (configurable). It achieves higher locality but at a small cost of delaying some tasks.
Other people are working on better schedulers, and this may likely be improved in the future. For now, you can choose to use the Fair Scheduler if you wish to achieve higher data locality.
I disagree with @donald-miner's conclusion that "With a default replication factor of 3, you don't see very many tasks that are not data local." He is correct in noting that more replicas will give improve your locality %, but the percentage of data-local tasks may still be very low. I've also ran experiments myself and saw very low data locality with the FIFO scheduler. You could achieve high locality if your job is large (has many tasks), but for the more common, smaller jobs, they suffer from a problem called "head-of-line scheduling". Quoting from this paper:
The first locality problem occurs in small jobs (jobs that have small input files and hence have a small number of data blocks to read). The problem is that whenever a job reaches the head of the sorted list [...] (i.e. has the fewest running tasks), one of its tasks is launched on the next slot that becomes free, no matter which node this slot is on. If the head-of-line job is small, it is unlikely to have data on the node that is given to it. For example, a job with data on 10% of nodes will only achieve 10% locality.
That paper goes on to cite numbers from a production cluster at Facebook, and they reported observing just 5% of data locality in a large, production environment.
Final note: Should you care if you have low data locality? Not too much. The running time of your jobs may be dominated by the stragglers (tasks that take longer to complete) and shuffle phase, so improving data locality would only have a very modest improve in running time (if any at all).
Upvotes: 6
Reputation: 39893
Unfortunately, the default scheduler isn't that smart. I'm not sure exactly what's going on, but I think it's using some sort of greedy-style scheduling where it tries to schedule what it can now for the next task, and then moves on. There could definitely be improvements made to the hadoop scheduler and there have been a few academic attempts and making hadoop scheduling more optimal.
This research paper shows that the default hadoop scheduler is not optimal. In the results, they show that increasing the replication factor to three improves data locality significantly, with diminishing returns after that.
So, why hasn't the default scheduler been improved? Here is my opinion/theory: With a default replication factor of 3, you don't see very many tasks that are not data local. By having more replicas, you give the schedule more flexibility to fit tasks in the right spots. Basically, it's a coincidence that you have 3 replicas, and the default scheduler takes advantage of that by being implemented in a lazy manner. Since you typically have 3 replicas for redundancy sake already... there isn't much motivation to help scheduler performance for people with a replication of 1.
If you have the space, I suggest just upping the replication factor to two or three. There really isn't much downside.
Upvotes: 1