Reputation: 6381
When I invoke ./stop-yarn.sh then ./start-yarn.sh, all the ongoing jobs will print something as below:
14/10/22 16:23:28 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:29 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:30 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:31 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:32 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:33 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:34 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:35 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:36 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
14/10/22 16:23:37 INFO ipc.Client: Retrying connect to server: 644v3.mzhen.cn/192.168.7.210:18040. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
or:
14/10/22 16:28:19 ERROR security.UserGroupInformation: PriviledgedActionException as:supertool (auth:SIMPLE) cause:java.io.IOException: org.apache.hadoop.yarn.exceptions.ApplicationNotFoundException: Application with id 'application_1413966215954_0002' doesn't exist in RM.
Is there any way to restart YARN without affecting all the ongoing jobs? Thanks a lot~
Upvotes: 2
Views: 4332
Reputation: 294487
You need a High Availability configured ResourceManager. Read Deploy ResourceManager HA Cluster for how to configure such a cluster. Then you'll be able to failover, manually or automatically, the RM.
This link explains more: ResourceManager High Availability
Actually, since 2.4.0 onward, is possible to restart RM and keep accepted applications (MR jobs) w/o a secondary HA RM. See ResourceManger Restart:
ResourceManager is the central authority that manages resources and schedules applications running atop of YARN. Hence, it is potentially a single point of failure in a Apache YARN cluster.
This document gives an overview of ResourceManager Restart, a feature that enhances ResourceManager to keep functioning across restarts and also makes ResourceManager down-time invisible to end-users.
ResourceManager Restart feature is divided into two phases:
Phase 1: Enhance RM to persist application/attempt state and other credentials information in a pluggable state-store. RM will reload this information from state-store upon restart and re-kick the previously running applications. Users are not required to re-submit the applications.
Phase 2: Focus on re-constructing the running state of ResourceManger by reading back the container statuses from NodeMangers and container requests from ApplicationMasters upon restart. The key difference from phase 1 is that previously running applications will not be killed after RM restarts, and so applications won't lose its work because of RM outage.
As of Hadoop 2.4.0 release, only ResourceManager Restart Phase 1 is implemented which is described below.
Upvotes: 4
Reputation: 7462
Simple answer: No!
When you stop the resource manager, tasktrackers and datanodes cannot communicate with the master node, and hence with each other (as they don't know where to ask their input data from). Moreover, the nodes don't know where the data is stored. These are all information stored in the (master). While a job is running, all this information is needed in order to proceed, so stopping the resource manager, the running job will fail.
EDIT: Apparently, things are not so simple anymore, as @Remus Rusanu's answer proved :)
Upvotes: 1