Ben
Ben

Reputation: 127

What is the purpose of uber mode in Hadoop?

New to Hadoop here. When a job is running in uber mode, the ApplicationMaster does not request containers from the ResourceManager. Instead, the AM, which is running on a single node, just executes the entire job on its own process. This is advantageous because it reduces the overhead of having to deal with the RM.

What I don't understand: If a job is small enough to be completed in a reasonable amount of time on a single node, what is the point of submitting a MapReduce job in the first place? MapReduce speeds up computation by allowing computation to be performed in parallel across multiple machines. If we only intend to use one node, why not just write a regular program and run it on our local machines?

Upvotes: 3

Views: 2027

Answers (2)

Suresh Vadali
Suresh Vadali

Reputation: 139

One particular scenario which I experienced with Apache Crunch is, a pipeline consists of number of MapReduce (MR) jobs spun by various DoFn's (where the core logic is written), each DoFn results into a Map and/or reduce job whose output is, generally stored into a immutable distributed object (PTable/PCollection), in this scenario based on amount of data processed by these DoFn's running on PTable/PCollection, framework decides whether to run each MR jobs in the pipeline in uber or normal mode. So, when we look at the final job counters of this pipeline, it can be a mix of both uber and/or normal MR jobs.

Consider another scenario where the M/R job runs in incremental and full load mode, where same logic may be fed with lesser data that can be processed by least number of Mappers and a Reducer and alternatively it may be fed with full load of historical data that requires larger number of Mappers and Reducers to process, so essentially the logic remains same but data and number of inputsplits changes and in those cases you don't want to move in and out of Hadoop cluster to process your data based on the size and let the framework decide the mode (uber or normal).

Upvotes: 0

Binary Nerd
Binary Nerd

Reputation: 13937

Perhaps some reasons might be:

  1. You have a reusable process that can scale up if needed, in which case it might start using more slots and not run in uber mode.
  2. Keeping things simple. Its unlikely you would write that one job, typically you will have many which process varying amounts of data. Why change things and choose a specific job to process the data using a different method.
  3. A program running outside of MapReduce would likely loose a number of the additional benefits provided by the framework, such as failure recovery.

Upvotes: 1

Related Questions