Shashwath
Shashwath

Reputation: 452

My Yarn Map-Reduce Job is taking a lot of time

Input File size : 75GB

Number of Mappers : 2273

Number of reducers : 1 (As shown in the web UI)

Number of splits : 2273

Number of Input files : 867

Cluster : Apache Hadoop 2.4.0

5 nodes cluster, 1TB each.

1 master and 4 Datanodes.

It's been 4 hrs. now and still only 12% of map is completed. Just wanted to know given my cluster configuration does this make sense or is there anything wrong with the configuration?

Yarn-site.xml

         <property>
             <name>yarn.nodemanager.aux-services</name>
             <value>mapreduce_shuffle</value>
             </property>
             <property>
             <name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
             <value>org.apache.hadoop.mapred.ShuffleHandler</value>
             </property>
             <property>
             <name>yarn.resourcemanager.resource- tracker.address</name>
             <value>master:8025</value>
             </property>
             <property>
             <name>yarn.resourcemanager.scheduler.address</name>
             <value>master:8030</value>
             </property>
             <property>
              <name>yarn.resourcemanager.scheduler.address</name>
             <value>master:8030</value>
             </property>
             <property>
             <name>yarn.resourcemanager.address</name>
             <value>master:8040</value>
             </property>
             <property>
             <name>yarn.resourcemanager.hostname</name>
             <value>master</value>
             <description>The hostname of the RM.</description>
             </property>
             <property>
             <name>yarn.scheduler.minimum-allocation-mb</name>
             <value>1024</value>
             <description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
             </property>
             <property>
             <name>yarn.scheduler.maximum-allocation-mb</name>
             <value>8192</value>
             <description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
             </property>
             <property>
             <name>yarn.scheduler.minimum-allocation-vcores</name>
             <value>1</value>
             <description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
             </property>
             <property>
             <name>yarn.scheduler.maximum-allocation-vcores</name>
             <value>32</value>
             <description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
             </property>
             <property>
             <name>yarn.nodemanager.resource.memory-mb</name>
             <value>8192</value>
             <description>Physical memory, in MB, to be made available to running containers</description>
             </property>
             <property>
             <name>yarn.nodemanager.resource.cpu-vcores</name>
             <value>4</value>
             <description>Number of CPU cores that can be allocated for containers.</description>
             </property>
             <property>
             <name>yarn.nodemanager.vmem-pmem-ratio</name>
             <value>4</value>
             </property> 
             <property>
   <name>yarn.nodemanager.vmem-check-enabled</name>
   <value>false</value>
   <description>Whether virtual memory limits will be enforced for containers</description>
</property>

Map-Reduce job where I am using multiple outputs. So reducer will emit multiple files. Each machine has 15GB Ram. Containers running are 8. Total memory available is 32GB in RM Web UI.

Any guidance is appreciated. Thanks in advance.

Upvotes: 2

Views: 1002

Answers (1)

Marco99
Marco99

Reputation: 1659

A few points to check:

  1. The block & split size seems very small considering the data you shared. Try increasing both to an optimal level.

  2. If not used, use a custom partitioner that would uniformly spread your data across reducers.

  3. Consider using combiner.

  4. Consider using appropriate compression (while storing mapper results)

  5. Use optimum number of block replication.

  6. Increase the number of reducers as appropriate.

These will help increase performance. Give a try and share your findings!!

Edit 1: Try to compare the log generated by a successful map task with that of the long running map task attempt. (12% means 272 map tasks completed). You will get to know where it got stuck.

Edit 2: Tweak these parameters: yarn.scheduler.minimum-allocation-mb, yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts, mapreduce.task.io.sort.mb, mapreduce.task.io.sort.factor

These will improve the situation. Take trial and error approach.

Also refer: Container is running beyond memory limits

Edit 3: Try to understand a part of the logic, convert it to pig script, execute and see how it behaves.

Upvotes: 1

Related Questions