Reputation: 1048
I have a 3 master, 5 slave mesos setup. The servers can communicate just fine, a master is chosen and the slaves connect smoothly. But any slave that is idle and does not have an application running first gets "health check failed" on the master (the slave does not complain about anything or lose connection, I think) and then some time later the master complains "status update from unknown slave" and terminates the slave. This happens to all idle slaves while those with processes keep working without an issue.
Does anyone know how to fix this?
Attached an "excerpt" of the slave's log. I tried to clean it up a little
I0225 18:02:14.077440 9029 slave.cpp:3053] Current usage 60.93%. Max allowed age: 2.035008507120139days
I0225 18:02:28.615249 9025 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:45193
W0225 18:02:28.615352 9025 slave.cpp:2121] Could not find the executor for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.615947 9031 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: id) for task develop.id of framework fwid
I0225 18:02:28.616165 9031 status_update_manager.cpp:373] Forwarding status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to master@ip2:5050
I0225 18:02:28.616334 9031 slave.cpp:2252] Sending acknowledgement for status update TASK_KILLED (UUID: id) for task develop.id of framework fwid to executor(1)@ip1:45193
I0225 18:02:28.618074 9025 slave.cpp:508] Slave asked to shut down by master@ip2:5050 because 'Status update from unknown slave'
I0225 18:02:28.618239 9025 slave.cpp:1406] Asked to shut down framework fwid by master@ip2:5050
I0225 18:02:28.618273 9025 slave.cpp:1431] Shutting down framework fwid
I0225 18:02:28.618387 9025 slave.cpp:2878] Shutting down executor 'develop.id' of framework fwid
I0225 18:02:29.336168 9027 slave.cpp:2088] Handling status update TASK_KILLED (UUID: id) for task develop.id of framework fwid from executor(1)@ip1:42376
W0225 18:02:29.336278 9027 slave.cpp:2112] Ignoring status update TASK_KILLED (UUID: id) for task develop.id of framework fwid for terminating framework fwid
I0225 18:02:30.338100 9030 containerizer.cpp:997] Executor for container 'id' has exited
I0225 18:02:30.338213 9030 containerizer.cpp:882] Destroying container 'id'
I0225 18:02:30.343300 9025 slave.cpp:2596] Executor 'develop.id' of framework fwid exited with status 0
I0225 18:02:30.343474 9025 slave.cpp:2732] Cleaning up executor 'develop.id' of framework fwid
I0225 18:02:30.343935 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999602148148days in the future
I0225 18:02:30.344023 9025 slave.cpp:2807] Cleaning up framework fwid
I0225 18:02:30.344100 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/slaves/S12/frameworks/fwid/executors/develop.id' for gc 6.9999960201037days in the future
I0225 18:02:30.344174 9029 gc.cpp:56] Scheduling '/mnt/spark/mesos/meta/slaves/S12/frameworks/fwid/executors/develop.id/runs/id' for gc 6.99999601960593days in the future
I0225 18:02:30.344216 9025 slave.cpp:466] Slave terminating
Upvotes: 2
Views: 885
Reputation: 4322
The "health check failed" message means that the master was unable to PING the slave (or at least didn't receive its PONGs) within the past minute and a half. Do you have intermittent network issues? Did you try pinging the slave from the master (and v.v.)? Are there any firewall issues on the slave for port 5051 (or whichever port you used)?
Upvotes: 1