odedfos
odedfos

Reputation: 4619

Mesos - how do I keep the executor task running when scheduler disconnects

I'm trying to implement a Mesos framework in which there is a scheduler with custom scheduling logic and long running tasks.

Occasionally the scheduler needs to be restarted due to code deployment.

I noticed that whenever the scheduler disconnects all the running executors are stopped.

I0202 14:12:48.099814  8539 exec.cpp:383] Executor asked to shutdown

My goals:

  1. I would like for the executor to keep on running during the scheduler restart.

  2. I want the scheduler to detect the active tasks when it comes back up again.

Can I achieve this with mesos?

Upvotes: 2

Views: 996

Answers (1)

serejja
serejja

Reputation: 23851

Yes you can for both goals:

  1. There is a configuration for every framework called failover timeout which means "how much time to wait until killing executors if a scheduler disconnects" and defaults to 0 (e.g. kill immediately if scheduler disconnects). To change this you specify failover timeout for your framework via FrameworkInfo during registration (like Mesos Kafka Scheduler)

  2. Mesos has a mechanism called Reconciliation to deal with such cases . In short, when your scheduler fails for some reason you need to restart it using the same framework id (explicitly saying that you'll have to store framework id somewhere and restore it after failures) and perform reconciliation.

    During reconciliation Mesos will send you status updates for all known tasks to update your scheduler state. Imagine a situation when you have a framework with 5 tasks running and then your scheduler died. Before you restarted scheduler 2 tasks also died. Then, after you reregister your scheduler and perform reconciliation Mesos should send you status updates for 5 tasks - 3 of them would be TASK_RUNNING and 2 TASK_LOST for dead tasks. This way you'll be able to sync with Mesos and restore control on active tasks.

Upvotes: 7

Related Questions