clarification on Map tasks and reduce tasks in hadoop?

Question

I am reading Hadoop: The definitive guide. During the course of understanding some of the concepts, I read a few SO posts, which made me equally confusing yet clarifying. Here are some points that I would need expert opinions on whether it is correct and what exactly happens if it not correct?

Lets assume this is how my HDFS looks in Psuedo-distributed cluster with one Node:

/local/path/to/datanode/storage/0/blk_00001  300 MB
/local/path/to/datanode/storage/0/blk_00002  300 MB
/local/path/to/datanode/storage/0/blk_00003  300 MB
/local/path/to/datanode/storage/0/blk_00004  200 MB

My total size of file is 1100 MB and it got split into chunks of 300 MB (which is my block size).

Now I am about to start my Mapreduce jobs:

I understand that InputFormat (which in turn splits the file) determines the number of maps.

CASE 1:

I have the following setting: split size mapred.min.split.size=400 MB

There will be three MR jobs in total. Each will have a input size of 400 MB to process.

1) Mapper 1: That would mean the first MR job would use 300 MB from blk_00001 and 100 MB from blk_00002, (data locality is lost).

2) Mapper 2: Now the second mapper has to seek from position 101 MB blk_00002 + another 200 MB from blk_00003.

) Mapper 3: There is 100 MB left on blk_0003 to be processed along with 200 MB from blk_0004. Now the size is 300 MB which should be processed as whole.

Block Size has NO ROLE in MAPRED tasks.

Q1: Is everything correct till here???????

CASE 2:

Now lets say I have the following setting for my MR job: mapred.tasktracker.map.tasks.maximum=3. This would mean for any given node, run three map tasks in parallel.

Q2: If all the mappers above runs parallel on same node, do they run in different Threads with similar priority or a separate process at CPU level.

CASE 3: if my num in conf.setNumMapTasks(int num) is greater than number of splits. That is say, I have num = 10 and the number of splits = 3. The total MR job that will be executed is 3.

**Q3:**Correct??

Reducer tasks:

Q4 A mapper has to finish before the reducer starts - in all cases to my knowledge, any examples where it will not be. because the keys need sorting and handed them in order to reducers.

Q5 so what is the effect of mapred.reduce.slowstart.completed.maps=0.5. This would mean when the map task is 50% complete, start the reducers. but the reducer needs map jobs to be complete. correct???

2) what is the default number of reducers if I don't specify anything.

It has been proposed to use 0.95 - 1.75 * (nodes * mapred.tasktracker.tasks.maximum). so if I have cluster with 5 nodes each with 5 cores, the formula gives (0.95 * 5 * 5) = 24 reducers .

so should I set conf.setNumReduceTasks(24)????

Mike Park · Accepted Answer

Q1: Is everything correct till here???????

It depends on the input format. FileInputFormat will not make a split that is less than the block size, regardless of what you set the min split size to. Here is the code that computes the split size.

protected long computeSplitSize(long goalSize, long minSize, long blockSize) {
  return Math.max(minSize, Math.min(goalSize, blockSize));
}

Q2: If all the mappers above runs parallel on same node, do they run in different Threads with similar priority or a separate process at CPU level.

Each task runs in it's own Java virtual machine, so separate processes.

Q3: if my num in conf.setNumMapTasks(int num) is greater than number of splits. That is say, I have num = 10 and the number of splits = 3. The total MR job that will be executed is 3.

The setNumMapTasks() is not supported anymore and was only meant as a hint to the MapReduce system.

Q4: A mapper has to finish before the reducer starts - in all cases to my knowledge, any examples where it will not be. because the keys need sorting and handed them in order to reducers.

Q5 so what is the effect of mapred.reduce.slowstart.completed.maps=0.5. This would mean when the map task is 50% complete, start the reducers. but the reducer needs map jobs to be complete. correct???

Slow start involves copying the data to the appropriate machine. The reduce() method in the reducer does not get invoked until all mappers are complete.

what is the default number of reducers if I don't specify anything.

1

so should I set conf.setNumReduceTasks(24)????

Whatever works best for your task.

clarification on Map tasks and reduce tasks in hadoop?

Answers (1)

Related Questions