Novice
Novice

Reputation: 155

YARN: maximum parallel Map task count

Following is mentioned in the Hadoop definitive guide

"What qualifies as a small job? By default one that has less than 10 mappers, only one reducer, and the input size is less than the size of one HDFS block. "

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ? In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

Upvotes: 2

Views: 7115

Answers (2)

Novice
Novice

Reputation: 155

mapreduce.job.maps = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.map.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.map.cpu.vcores, number of physical drives x workload factor) x number of worker nodes

mapreduce.job.reduces = MIN(yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb,yarn.nodemanager.resource.cpu-vcores / mapreduce.reduce.cpu.vcores, # of physical drives xworkload factor) x # of worker nodes

The workload factor can be set to 2.0 for most workloads. Consider a higher setting for CPU-bound workloads.

yarn.nodemanager.resource.memory-mb( Available Memory on a node for containers )= Total System memory – Reserved memory( like 10-20% of memory for Linux and its daemon services) -   HDFS Data node ( 1024 MB) – (resources for task buffers, such as the HDFS Sort I/O buffer) – (Memory allocated for DataNode( default 1024 MB), NodeManager, RegionServer etc.)

Hadoop is a disk I/O-centric platform by design. The number of independent physical drives (“spindles”) dedicated to DataNode use limits how much concurrent processing a node can sustain. As a result, the number of vcores allocated to the NodeManager should be the lesser of either:

 [(total vcores) – (number of vcores reserved for non-YARN use)] or  [ 2 x (number of physical disks used for DataNode storage)]

So

yarn.nodemanager.resource.cpu-vcores = min{ ((total vcores) – (number of vcores reserved for non-YARN use)),  (2 x (number of physical disks used for DataNode storage))}

Available vcores  on a node for containers = total no. of vcores – for operating system( For calculating vcore demand, consider the number of concurrent processes or tasks each service runs as an initial guide. For OS we take 2 ) – Yarn node Manager( Def. is  1) – HDFS data node( Def. is  1).

Note ==>

mapreduce.map.memory.mb is combination of both mapreduce.map.java.opts.max.heap + some head room (safety value)

The settings for mapreduce.[map | reduce].java.opts.max.heap specify the default memory allotted for mapper and reducer heap size, respectively. The mapreduce.[map| reduce].memory.mb settings specify memory allotted their containers, and the value assigned should allow overhead beyond the task heap size. Cloudera recommends applying a factor of 1.2 to the mapreduce.[map | reduce].java.opts.max.heap setting. The optimal value depends on the actual tasks. Cloudera also recommends setting mapreduce.map.memory.mb to 1–2 GB and setting mapreduce.reduce.memory.mb to twice the mapper value. The ApplicationMaster heap size is 1 GB by default, and can be increased if your jobs contain many concurrent tasks.


Reference –

Upvotes: 0

Ashrith
Ashrith

Reputation: 6855

But how does it count no of mapper in a job before executing it on YARN ? In MR1 number of mapper depends on the no. of input splits. is the same applies for the YARN as well ?

Yes, in YARN as well if you are using MapReduce based frameworks, the number of mappers depend on input splits.

In YARN containers are flexible. So Is there any way for computing max number of map task that can run on a given cluster in parallel( some kind of tight upper bound, because it will give me rough idea about how much data i can process in parallel ? ) ?

The number of map tasks that can run in parallel on the YARN cluster depends on how many containers that can be launched and run in parallel on the cluster. This ultimately depends on how you will configure MapReduce in the cluster, which is clearly explained clearly in this guide from cloudera.

Upvotes: 3

Related Questions