babalu
babalu

Reputation: 622

Mesos slave node unable to restart

I've setup a Mesos cluster using the CloudFormation templates from Mesosphere. Things worked fine after cluster launch.

I recently noticed that none of the slave nodes are listed in the Mesos dashboard. EC2 console shows the slaves are running & pass health checks. I restarted nodes on cluster but that didn't help. I ssh'ed into one of the slaves and noticed mesos-slave services are not running. Executed sudo systemctl status dcos-mesos-slave.service but that couldn't start the service.

Looked in /var/log/mesos/ and tail -f mesos-slave.xxx.invalid-user.log.ERROR.20151127-051324.31267 and saw the following...

F1127 05:13:24.242182 31270 slave.cpp:4079] CHECK_SOME(state::checkpoint(path, bootId.get())): Failed to create temporary file: No space left on device

But the output of df -h and free show there is plenty of disk space left.

Which leads me to wonder, why is it complaining about no disk space?

Upvotes: 0

Views: 1715

Answers (2)

Dino L.
Dino L.

Reputation: 137

It is good practice to run

docker rmi -f $(docker images | grep "<none>" | awk "{print \$3}")

this way you will free space by deleting unused docker images

Upvotes: 0

babalu
babalu

Reputation: 622

Ok I figured it out.

When running Mesos for a long time or under frequent load, the /tmp folder won't have any disk space left since Mesos uses the /tmp/mesos/ as the work_dir. You see, the filesystem can only hold a certain number of file references(inodes). In my case, slaves were collecting large number of file chuncks from image pulls in /var/lib/docker/tmp.

To resolve this issue:

1) Remove files under /tmp

2) Set a different work_dir location

Upvotes: 1

Related Questions