Hadoop: how reducer nodes are selected?

Question

I just started learning Hadoop, but don't understand how a datanode becomes a reducer node.

Once the map task completes, the content of its sort buffer is flushed to the local disk after the KV pairs are sorted and partitioned
Then the jobtracker is notified about the spilled partitions.
After then the reducers start asking the data from a particular partition.

But how the jobtracker decides which node becomes a reducer node? I'm reading the Hadoop Definitive guide but this step is not mentioned in the book.

Thanks, Bruckwald

Thomas Jungblut · Accepted Answer

Pretty much first-come, first-serve. Tasks are assigned by heartbeats, so if a Tasktracker pings the Jobtracker that it is alive, it will get a response that might contain a new task to run:

List tasks = getSetupAndCleanupTasks(taskTrackerStatus);
if (tasks == null ) {
   tasks = taskScheduler.assignTasks(taskTrackerStatus);
}
if (tasks != null) {
   for (Task task : tasks) {
     expireLaunchingTasks.addNewTask(task.getTaskID());
     LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());
     actions.add(new LaunchTaskAction(task));
   }
}

Here's the relevant source code of the Jobtracker. So besides which tasktracker comes first, the taskscheduler will check for resource conditions (e.g. if there is a free slot, or a single node is not overloaded).

The relevant code can be found here (which isn't particular exciting):

//
// Same thing, but for reduce tasks
// However we _never_ assign more than 1 reduce task per heartbeat
//
final int trackerCurrentReduceCapacity = 
  Math.min((int)Math.ceil(reduceLoadFactor * trackerReduceCapacity), 
           trackerReduceCapacity);
final int availableReduceSlots = 
  Math.min((trackerCurrentReduceCapacity - trackerRunningReduces), 1);
boolean exceededReducePadding = false;
if (availableReduceSlots > 0) {
  exceededReducePadding = exceededPadding(false, clusterStatus, 
                                          trackerReduceCapacity);
  synchronized (jobQueue) {
    for (JobInProgress job : jobQueue) {
      if (job.getStatus().getRunState() != JobStatus.RUNNING ||
          job.numReduceTasks == 0) {
        continue;
      }

      Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts());                             
      if (t != null) {
        assignedTasks.add(t);
        break;
      }

      // Don't assign reduce tasks to the hilt!
      // Leave some free slots in the cluster for future task-failures,
      // speculative tasks etc. beyond the highest priority job
      if (exceededReducePadding) {
        break;
      }
    }
  }

Basically, the first tasktracker that heartbeats to the Jobtracker and has enough slots available will get a reduce tasks.

Hadoop: how reducer nodes are selected?

Answers (1)

Related Questions