Reputation: 1375
I have a small Mesos cluster and I'm using Marathon to manage a set of long-running services with a variable number of instances each.
I'd like to be able to launch new nodes or terminate some of them as required by business needs. However, when terminating a node I realized there is a potential problem: when I shut down a Mesos slave, it happens that the number of instances of some services falls temporarily below the defined minimumHealthCapacity
. That can lead to some downtime if, for example, the machine to be stopped is running a service with only one instance.
Consider the following simplified scenario: node 1 is running service A, node 2 is running service B and node 3 is running service C. The minimumHealthCapacity
for all services is 1. I want to terminate node 1 and leave only 2 and 3 running. I don't want any downtime on service A. An example of intended behavior would be to scale service A to 2 and then safely terminating node 1.
What can I do to make sure no service falls below the minimumHealthCapacity
?
Ideally, I would have a rolling-update inspired process for that - replacements are launched in separate machines, followed by the termination of the services in the machine to be shut down. I would like to have at least an automated process to do that, so that a scale down is a simple script away. I have no requirement for the amount of time it takes to do that, i.e. I can shut down the Mesos slave only after I'm sure the Marathon migration is finished and successful.
Upvotes: 8
Views: 1255
Reputation: 4322
The Mesos dev team is currently working on "Maintenance Primitives" so that an operator can indicate that a particular machine is scheduled to go down at a certain time (or ASAP), triggering messages to each framework notifying them of the intended unavailability window. A framework like Marathon could then decide to migrate its tasks away from that node so that it can be safely terminated without any service downtime.
See https://issues.apache.org/jira/browse/MESOS-1474 for more details/patches.
Upvotes: 1