Reputation: 452
Input File size : 75GB
Number of Mappers : 2273
Number of reducers : 1 (As shown in the web UI)
Number of splits : 2273
Number of Input files : 867
Cluster : Apache Hadoop 2.4.0
5 nodes cluster, 1TB each.
1 master and 4 Datanodes.
It's been 4 hrs. now and still only 12% of map is completed. Just wanted to know given my cluster configuration does this make sense or is there anything wrong with the configuration?
Yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux- services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.resourcemanager.resource- tracker.address</name>
<value>master:8025</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8040</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>master</value>
<description>The hostname of the RM.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
<description>Minimum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>8192</value>
<description>Maximum limit of memory to allocate to each container request at the Resource Manager.</description>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM, in terms of virtual CPU cores. Requests lower than this won't take effect, and the specified value will get allocated the minimum.</description>
</property>
<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>32</value>
<description>The maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value.</description>
</property>
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>8192</value>
<description>Physical memory, in MB, to be made available to running containers</description>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>4</value>
<description>Number of CPU cores that can be allocated for containers.</description>
</property>
<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>4</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
<description>Whether virtual memory limits will be enforced for containers</description>
</property>
Map-Reduce job where I am using multiple outputs. So reducer will emit multiple files. Each machine has 15GB Ram. Containers running are 8. Total memory available is 32GB in RM Web UI.
Any guidance is appreciated. Thanks in advance.
Upvotes: 2
Views: 1002
Reputation: 1659
A few points to check:
The block & split size seems very small considering the data you shared. Try increasing both to an optimal level.
If not used, use a custom partitioner that would uniformly spread your data across reducers.
Consider using combiner.
Consider using appropriate compression (while storing mapper results)
Use optimum number of block replication.
Increase the number of reducers as appropriate.
These will help increase performance. Give a try and share your findings!!
Edit 1: Try to compare the log generated by a successful map task with that of the long running map task attempt. (12% means 272 map tasks completed). You will get to know where it got stuck.
Edit 2: Tweak these parameters: yarn.scheduler.minimum-allocation-mb, yarn.scheduler.maximum-allocation-mb, yarn.nodemanager.resource.memory-mb, mapreduce.map.memory.mb, mapreduce.map.java.opts, mapreduce.reduce.memory.mb, mapreduce.reduce.java.opts, mapreduce.task.io.sort.mb, mapreduce.task.io.sort.factor
These will improve the situation. Take trial and error approach.
Also refer: Container is running beyond memory limits
Edit 3: Try to understand a part of the logic, convert it to pig script, execute and see how it behaves.
Upvotes: 1