Running multiple map tasks in parallel

Question

I am using hadoop 2.0. When I alter the number of map tasks using job.setNumMapTasks, the number is as expected (in the number of sequence files in the output folder and the number of containers), but they do not run in parallel, but only 2 at a time. For instance, when I set the number of map tasks to 5, its like 2 of them executed first, followed by 2 more, followed by 1. I have an 8-core system and would like to utilize it to the fullest. A bit of online hunting (including StackOverflow) seemed to suggest a few things and I tried the following:

Adjusted the parameter "mapred.tasktracker.map.tasks.maximum" in mapred-site.xml to set the number of tasks running in parallel. I set it to 8.
Reduced the parameter, "mapred.max.split.size". My input sequence file size is 8448509 or approximately 8 MB. Hence I set it to 2097152 (2 MB).
Lowered the DFS block size, "dfs.block.size in dfs-site.xml. I learnt that the block size by default is 64MB. I lowered it to 2097152 (2 MB).

In spite of all this, I do not see any change in performance. Its still 2 map tasks at a time. I did not format my hdfs and reload the sequence file after 3. Not sure if that is the reason though.

You can access my configuration files at https://www.dropbox.com/sh/jnxsm5m2ic1evn4/zPVcdk8GTp. Am I missing something here?

Also, I had another question. Some posts seem to mention that job.setNumMapTasks is just an indicator to the environment and the actual number is decided by the environment. However, I always find the number of tasks as whatever I specify. Is that expected?

Thanks and Regards, Samudra

SachinJose · Accepted Answer

In classic mapreduce framework(MR1) you can set the number of map slots by using the propertymapred.tasktracker.map.tasks.maximum. But in YARN, things are different. See the below discussion on map/reduce slots in YARN

https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/J564g9A8tPE

Running multiple map tasks in parallel

Answers (1)

Related Questions