elmalto
elmalto

Reputation: 1048

mesos slave gets shut down if unused

I have a 3 master, 5 slave mesos setup. The servers can communicate just fine, a master is chosen and the slaves connect smoothly. But any slave that is idle and does not have an application running first gets "health check failed" on the master (the slave does not complain about anything or lose connection, I think) and then some time later the master complains "status update from unknown slave" and terminates the slave. This happens to all idle slaves while those with processes keep working without an issue.

Does anyone know how to fix this?

Attached an "excerpt" of the slave's log. I tried to clean it up a little

I0225 18:02:14.077440  9029 slave.cpp:3053] Current usage 60.93%. Max allowed age: 2.035008507120139days
I0225 18:02:28.615249  9025 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:45193
W0225 18:02:28.615352  9025 slave.cpp:2121] Could not find the executor for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.615947  9031 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.616165  9031 status_update_manager.cpp:373] Forwarding status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to master@ip2:5050
I0225 18:02:28.616334  9031 slave.cpp:2252] Sending acknowledgement for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to executor(1)@ip1:45193
I0225 18:02:28.618074  9025 slave.cpp:508] Slave asked to shut down by master@ip2:5050 because 'Status update from unknown slave'
I0225 18:02:28.618239  9025 slave.cpp:1406] Asked to shut down framework fwid by master@ip2:5050
I0225 18:02:28.618273  9025 slave.cpp:1431] Shutting down framework fwid
I0225 18:02:28.618387  9025 slave.cpp:2878] Shutting down executor 'develop.id' of framework fwid
I0225 18:02:29.336168  9027 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:42376
W0225 18:02:29.336278  9027 slave.cpp:2112] Ignoring status update TASK_KILLED (UUID: id) for task develop.id of framework fwid for terminating framework fwid
I0225 18:02:30.338100  9030 containerizer.cpp:997] Executor for container 'id' has exited
I0225 18:02:30.338213  9030 containerizer.cpp:882] Destroying container 'id'
I0225 18:02:30.343300  9025 slave.cpp:2596] Executor 'develop.id' of framework fwid exited with status 0
I0225 18:02:30.343474  9025 slave.cpp:2732] Cleaning up executor 'develop.id' of framework fwid
I0225 18:02:30.343935  9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999602148148days in the future
I0225 18:02:30.344023  9025 slave.cpp:2807] Cleaning up framework fwid
I0225 18:02:30.344100  9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id' for gc 6.9999960201037days in the future
I0225 18:02:30.344174  9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/meta/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999601960593days in the future
I0225 18:02:30.344216  9025 slave.cpp:466] Slave terminating

Upvotes: 2

Views: 885

Answers (1)

Adam
Adam

Reputation: 4322

The "health check failed" message means that the master was unable to PING the slave (or at least didn't receive its PONGs) within the past minute and a half. Do you have intermittent network issues? Did you try pinging the slave from the master (and v.v.)? Are there any firewall issues on the slave for port 5051 (or whichever port you used)?

Upvotes: 1

Related Questions