Faiza Atheeq
Faiza Atheeq

Reputation: 11

Hadoop Map reduce - how to speed up job launch/setup

I'm using the mongo-hadoop adapter to run map/reduce jobs. everything is fine except the launch time and the time taken by the job. Even when the dataset is very small, the map time is 13 seconds and reduce time is 12 seconds. In fact I have changed settings in mapred-site.xml and core-site.xml. but the time taken for map/reduce seems to be constant. is there any way i can reduce it. I also explored the optimized hadoop distribution from hanborq. they use a worker pool for faster job launch/setup. is there any equivalent available elsewhere as the hanborq distribution is not very active. it was updated 4 months ago and is built on an older version of hadoop.

some of my settings are as follows: mapred-site.xml:

<property>
    <name>mapred.child.java.opts</name>
    <value>-Xms1g</value>
</property>
<property>
    <name>mapred.sort.avoidance</name>
    <value>true</value>
</property>
 <property>
      <name>mapred.job.reuse.jvm.num.tasks</name>
          <value>-1</value>
 </property>
<property>
     <name>mapreduce.tasktracker.outofband.heartbeat</name>
     <value>true</value>
</property>
   <property>
       <name>mapred.compress.map.output</name>
       <value>false</value>
   </property>

core-site.xml:

<property>
          <name>io.sort.mb</name>
          <value>300</value>
    </property>
<property>
    <name>io.sort.factor</name>
    <value>100</value>
</property>

Any help would be greatly appreciated. thanks in advance.

Upvotes: 1

Views: 2024

Answers (1)

Kun Ling
Kun Ling

Reputation: 2219

Since the heartbeat cause part of the latency. The task trackers heartbeat in to the job tracker to let it know they're alive, but as part of that heartbeat, they also announce how many open map and reduce slots they have. In response, the JT assigns work for the TT to perform. This means that when you submit a job TTs only get tasks as fast as they heartbeat (every 2 - 4 seconds, give or take). Additionally, the JT (by default) only assigns a single task during each heartbeat. This means that if you only have a single TT you can only assign 1 task every 2 - 4 seconds even if the TT has additional capacity.

So, you can:

  1. shorten the duration between two heartbeat.

  2. change how the task scheduler works for each heartbeat from TaskTracker. mapred.fairscheduler.assignmultiple

Upvotes: 1

Related Questions