Cannot Start Mesos/Marathon Cluster

Question

Physical Machine: 192.168.10.1 ( Mesos, Zookeeper, Marathon )
Virtual Machine: 192.168.122.10 ( Mesos, Zookeeper )
Virtual Machine: 192.168.122.46 ( Mesos, Zookeeper )

OS for all three machines are Fedora 23 Server

The two networks are already inter-routed by default as the virtual machines all reside on the physical machine.

There is no firewall setup.

Mesos Election LOG:

Master bound to loopback interface! Cannot communicate with remote schedulers or slaves. You might want to set '--ip' flag to a routable IP address.

I can set this manually, however I cannot set this dynamically... the --ip_discovery_command flag is not recognized.

What I wanted to do was link the below script to that flag.

if [[ $(ip addr) == *enp8s0* ]]; 
then 
    ip addr show enp8s0 | awk -F'/| ' '/inet/ { print $6 }'
else 
    ip addr show eth0 | awk -F'/| ' '/inet/ { print $6 }'
fi

When I do set this manually (not what I want to do)...

the Mesos page at IP:5050 comes up... but then the mesos-master fails after 1 minute due to this...

F0427 17:03:27.975260  6914 master.cpp:1253] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
*** Check failure stack trace: ***
    @     0x7f8360fa9edd  (unknown)
    @     0x7f8360fabc50  (unknown)
    @     0x7f8360fa9ad3  (unknown)
    @     0x7f8360fac61e  (unknown)
    @     0x7f83619a85dd  (unknown)
    @     0x7f83619e7c30  (unknown)
    @     0x55a885ee3b2e  (unknown)
    @     0x7f8361a11c0e  (unknown)
    @     0x7f8361a5d75e  (unknown)
    @     0x7f8361a7077a  (unknown)
    @     0x7f83618f4aae  (unknown)
    @     0x7f8361a70768  (unknown)
    @     0x7f8361a548d0  (unknown)
    @     0x7f8361fc832c  (unknown)
    @     0x7f8361fd42a5  (unknown)
    @     0x7f8361fd472f  (unknown)
    @     0x7f8360a5e60a  start_thread
    @     0x7f835fefda4d  __clone Aborted (core dumped)

Zookeeper is setup like this:

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just
# example sakes.
dataDir=/var/lib/zookeeper/data
dataLogDir=/var/lib/zookeeper/log
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1
server.1:192.168.10.1:2888:3888
server.2:192.168.122.46:2888:3888
server.3:192.168.122.10:2888:3888

and have no idea how to verify that it is working properly...

I'm honestly on the end of my rope.. pulling out my hair for the past week on this due to poor documentation and lack of proper architecture explanations (primarily Marathon) horribly organized logs (Mesos), systemd being unable to properly parse a bash and use the output as a variable, and lack of instructions all around.

Am I doing something wrong? I Appreciate any assistance I can get, Let me know if you need anything I have not yet provided and I will post it right away.

EDIT:

I fixed the issue with marathon, by adding two additional Marathon servers to the VM's so that they could form a quorum.

EDIT2:

I am now having issues where the Mesos server keeps rapidly re-electing a leader... but depending on the outcome I will look into this later...

Tobi · Accepted Answer

If you follow the installation docs closely, I think you should get it to work.

For example you "Master binds to loopback" problem is IMHO related to incorrect/incomplete settings. See:

Hostname (optional)

If you're unable to resolve the hostname of the machine directly (e.g., if on a different network or using a VPN), set /etc/mesos-master/hostname to a value that you can resolve, for example, an externally accessible IP address or DNS hostname. This will ensure all links from the Mesos console work correctly.

You will also want to set this property in /etc/marathon/conf/hostname.

Furthermore, I'd recommend to also set the Master IP address in the /etc/mesos-master/ip file. Always make sure that the hostnames are resolvable to a non-local IP address, i.e. by adding entries in the /etc/hosts file on each host.

Basically, the /etc/hosts file should look similar to this (replace the hostnames with the actual ones):

127.0.0.1 localhost

192.168.10.1 host1
192.168.122.10 host2
192.168.122.46 host3

If you just want to test a Mesos cluster, you could also use a preconfigured Vagrant solution like tobilg/coreos-mesos-cluster.

Regarding the ZooKeeper setup, make sure that you created a /var/lib/zookeeper/myid on each node which contains the actual numeric id you set for each node, e.g. for 192.168.10.1 the sole content of the file needs to be 1.

Before debugging the masters, check that the ZooKeeper cluster works correctly, and that a leader is elected. Make sure that /etc/mesos/zk contains the right ZooKeeper connection string on each host, e.g.

zk://192.168.10.1:2181,192.168.122.10:2181,192.168.122.46:2181/mesos

If ZK works, then restart the services and check the Masters logs. Do the same with the Slaves.

References:

Cannot Start Mesos/Marathon Cluster

Answers (1)

Related Questions