Mohammed Asad
Mohammed Asad

Reputation: 1009

What is the purpose of "uber mode" in hadoop?

Hi I am a big data newbie. I searched all over the internet to find what exactly uber mode is. The more I searched the more I got confused. Can anybody please help me by answering my questions?

Upvotes: 32

Views: 32200

Answers (5)

Salim muneer lala
Salim muneer lala

Reputation: 99

First we need to understand what happens when a job is submitted by the user.

It goes to the Resource Manager.

The Resource Manager coordinates with one of the Node Manager and creates a container on that node.

In this container an Application Master service is started, which will handle this application locally.

This application master is now responsible to request more resources for the application to the Resource Manager.

  • Application master will first check with the Name Node to get the
    info where the data blocks are stored i.e. on which data nodes in the cluster, data blocks are present.

  • After getting info about the nodes where data is present, it will
    request for access of those nodes to use their resources, by
    resources we mean to say containers (CPU + memory), so that data
    locality is followed.

  • Every application has a separate application master.

After the Resource manager gives access to more resources, then on those nodes containers/executors are created. Those containers/executors are then handled by Node managers.

Uber Mode:

Sometimes what happens is the job is so small that it can run in the container where Application Master is running. So, in this case it won’t need to request for different containers.

Upvotes: 0

Azim
Azim

Reputation: 1091

Pretty good answers are given for "What is Uber Mode?" Just to add some more information for "Why?"

The application master decides how to run the tasks that make up the MapReduce job. If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain in running them in parallel, when compared to running them sequentially on one node.

Now, the questions could be raised as "What qualifies as a small job?

By default, a small job is one that has less than 10 mappers, only one reducer, and an input size that is less than the size of one HDFS block.

Upvotes: 4

ableHercules
ableHercules

Reputation: 660

What is UBER mode in Hadoop2?

Normally mappers and reducers will run by ResourceManager (RM), RM will create separate container for mapper and reducer. Uber configuration, will allow to run mapper and reducers in the same process as the ApplicationMaster (AM).

Uber jobs :

Uber jobs are jobs that are executed within the MapReduce ApplicationMaster. Rather then communicate with RM to create the mapper and reducer containers. The AM runs the map and reduce tasks within its own process and avoided the overhead of launching and communicate with remote containers.

Why

If you have a small dataset or you want to run MapReduce on small amount of data, Uber configuration will help you out, by reducing additional time that MapReduce normally spends in mapper and reducers phase.

Can I configure an Uber for all MapReduce job?

As of now, map-only jobs and jobs with one reducer are supported.

Upvotes: 49

Navneet Kumar
Navneet Kumar

Reputation: 3752

Uber Job occurs when multiple mapper and reducers are combined to use a single container. There are four core settings around the configuration of Uber Jobs in the mapred-site.xml. Configuration options for Uber Jobs:

  • mapreduce.job.ubertask.enable
  • mapreduce.job.ubertask.maxmaps
  • mapreduce.job.ubertask.maxreduces
  • mapreduce.job.ubertask.maxbytes

You can find more details here: http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.15/bk_using-apache-hadoop/content/uber_jobs.html

Upvotes: 11

Shubham Chaurasia
Shubham Chaurasia

Reputation: 2632

In terms of hadoop2.x, Uber jobs are the jobs which are launched in mapreduce ApplicationMaster itself i.e. no separate containers are created for map and reduce jobs and hence the overhead of creating containers and communicating with them is saved.

As far as working (with hadoop 1.x and 2.x) is concerned, I suppose the difference is only observable when it comes to terminologies of 1.x and 2.x, no difference in working.

Configuration params are same as those mentioned by Navneet Kumar in his answer.
PS: Use it only with small dataset.

Upvotes: 4

Related Questions