Reputation: 11
I am working around with marathon & mesos & docker very well, but it recently discovered a problem.when mesos-slave encounter an Exception , the state of task on Marathon will change to TASK_LOST , and the task can not be killed only after about 15mins.
I did a test by manually Reboot My Operation System that run mesos-slave service and docker and run the task, and then the task state shown in Marathon UI became to " Unscheduled(100%) " ,and the task can not be killed automatically either manually, until past about 15 minutes. My question is how to reduce this time? I tried to add marathon startup command line args with
task_launch_confirm_timeout=30000
scale_apps_interval = 30000
task_lost_expunge_initial_delay = 30000
task_launch_timeout = 30000
and add mesos-slave startup command line args with
recovery_timeout=1mins
but it doesn't work for me.
Upvotes: 0
Views: 397
Reputation: 6371
To forcefully change the time after executor commit suicide if Mesos agent process failed you should configure --recovery_timeout
Amount of time allotted for the agent to recover. If the agent takes longer than recovery_timeout to recover, any executors that are waiting to reconnect to the agent will self-terminate. (default: 15mins)
Upvotes: 2