Reputation: 175
What happens when the Resource Manager (RM) goes down in Yarn?
In the middle of running a job, if the Resource Manager goes down, then what will happen to the job?
Does the job gets submitted automatically or do we need to submit the job again?
Thanks,
Venkat
Upvotes: 4
Views: 6500
Reputation: 1
Upvotes: 0
Reputation: 38910
Resource manager (RM) high availability is explained in Apache link as follows.
ResourceManager HA is realized through an Active/Standby architecture.
At any point of time, one of the RMs is Active, and other standby node is waiting to take over if Active RM fails.
The RM being promoted to an active state loads the RM internal state from State-store and continues to operate from where the previous active left off.
A new attempt is spawned for each managed application previously submitted to the RM. Applications can checkpoint periodically to avoid losing any work.
The State-store must be visible from the both of Active/Standby RMs. Currently, there are two RMStateStore implementations for persistence - FileSystemRMStateStore and ZKRMStateStore.
The ZKRMStateStore (ZooKeeper) implicitly allows write access to a single RM at any point in time, and hence is the recommended store to use in an HA cluster.
Using the ZKRMStateStore, there is no need for a separate fencing mechanism to address a potential split-brain situation where multiple RMs can potentially assume the Active role.This situation is handled with ZooKeeper very well.
ZooKeeper is not only used for Resource Manager fail over. Many of applications now a days using ZooKeeper. Example of other fail over use cases in Hadoop - Name Node fail over also happens through ZooKeeper. Have a look at Name node fail over process too.
After Hadoop 2.x and Before Hadoop 2.6.x:
When a ResourceManager dies and is restarted, or fails over to another ResourceManager in the case of an HA cluster, the newly active ResourceManager instructs running ApplicationMasters to abort. This uses up an application attempt.
Also, if the ResourceManager is down for some time and the ApplicationMaster is unable to connect, it will timeout and abort. That uses up an application attempt too.
When a new ResourceManager becomes active, it can recover applications with failed attempts that have not exceeded their max-attempts.
Have a look at this article for more details
From Hadoop 2.6.0:
Resource Manager recovers its running state by taking advantage of the container statuses sent from all Node Managers. Node Manager will not kill the containers when it re-syncs with the restarted Resource Manager.
It continues managing the containers and send the container statuses across to Resource Manager when it re-registers.
Resource Manager reconstructs the container instances and the associated applications’ scheduling status by absorbing these containers’ information
Upvotes: 5
Reputation: 86
The admin will create a new resource manager.Will take the latest information from all the application managers and update the Persistent Storage which the new Resource Manager will use. It is purely an admin task
Upvotes: 0