kee
kee

Reputation: 11619

High Availability of Resource Manager, Node Manager and Application Master in YARN

From reading documentation around YARN, I couldn't find any relevant information about HA of resource manager, node manager and application master in YARN. Are they single point of failures? If so are there any plan to improve?

Upvotes: 0

Views: 2040

Answers (1)

zillion1
zillion1

Reputation: 471

A YARN cluster is comprised of a potentially large number of machines ("nodes"). To be part of the cluster, each node runs at least one service daemon. The service daemon's type determines the task this node plays in the cluster.

Almost all nodes run a "node manager" service deamon, which makes them "regular" YARN nodes. The node manager takes care of executing a certain part of a YARN job on this very machine, while other parts are executed on other nodes. It makes only sense to run a single node manager on each node. For a 1000 node YARN cluster, there are probably around 999 node managers running. So node managers are indeed redundantly distributed in the cluster. If one node manager fails, others are assigned to take over its tasks.

Every YARN job is an application of its own, and a dedicated application master daemon is started for the job on one of the nodes. For another application, another application master is started on a different node. The application's actual work is executed on even other nodes in the cluster. The application master only controls the overall execution of the application. If an application master dies, the whole application has failed, but other applications will continue. The failed application has to be restarted.

The resource manager daemon is running on one dedicated YARN node, tasked only with starting applications (by starting the related application master), with collecting information about all nodes in the cluster and with assigning computing resources to applications. The resource manager currently isn't build to be HA, but this normally isn't a problem. If the resource manager dies, all applications need to be restarted.

Upvotes: 3

Related Questions