Reputation: 35
I am using hadoop 2.0. When I alter the number of map tasks using job.setNumMapTasks, the number is as expected (in the number of sequence files in the output folder and the number of containers), but they do not run in parallel, but only 2 at a time. For instance, when I set the number of map tasks to 5, its like 2 of them executed first, followed by 2 more, followed by 1. I have an 8-core system and would like to utilize it to the fullest. A bit of online hunting (including StackOverflow) seemed to suggest a few things and I tried the following:
In spite of all this, I do not see any change in performance. Its still 2 map tasks at a time. I did not format my hdfs and reload the sequence file after 3. Not sure if that is the reason though.
You can access my configuration files at https://www.dropbox.com/sh/jnxsm5m2ic1evn4/zPVcdk8GTp. Am I missing something here?
Also, I had another question. Some posts seem to mention that job.setNumMapTasks is just an indicator to the environment and the actual number is decided by the environment. However, I always find the number of tasks as whatever I specify. Is that expected?
Thanks and Regards, Samudra
Upvotes: 1
Views: 1521
Reputation: 8522
In classic mapreduce framework(MR1) you can set the number of map slots by using the propertymapred.tasktracker.map.tasks.maximum
. But in YARN, things are different. See the below discussion on map/reduce slots in YARN
https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/J564g9A8tPE
Upvotes: 1