aladagemre
aladagemre

Reputation: 602

Mesos-master: Shutdown failed on fd=25: Transport endpoint is not connected [107]

When I run 3 mesos-master with QUORUM=2, they fail 1 minute after being elected as the leader, giving errors:

E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: Transport endpoint is not connected [107]

E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: Transport endpoint is not connected [107]

They keep electing one another in a loop, consistently failing and re-electing.

If I set QUORUM=1, everything works well. What could be the reason for this?

Upvotes: 2

Views: 4270

Answers (3)

alecventura
alecventura

Reputation: 41

We had a similar problem yesterday, marathon was a little weird because some applications were not been deployed. The strange was that the application goes up but the health check never turns green, and so nixy wasn't updating nginx.

After a lot of investigation we came to this very same error:

E0718 18:51:05.836688  5049 socket.hpp:107] Shutdown failed on fd=46: Transport endpoint is not connected [107]

In the end we discovery that the problem was in the election, even that our QUORUM=1 (we have 2 masters) somehow it looses itself and one master wasn't communicating with the other.

To solve this we triggered a new election using Marathon API /v2/leader DELETE method and everything worked fine after that.

Upvotes: 1

Tarwin
Tarwin

Reputation: 622

We had the same problem, the mesos-master log flooding with messages like:

mesos-master[27499]: E0616 14:29:39.310302 27523 socket.hpp:174] Shutdown failed on fd=67: Transport endpoint is not connected [107]

Turned out it was the loadbalancers health check to /stats.json

Upvotes: 0

aladagemre
aladagemre

Reputation: 602

One problem was that AWS firewall was blocking reaching public IPs of the server and zookeeper was broadcasting public IP (set in advertise_ip) so nobody was able to connect each other. Slaves also couldn't connect to the masters with the same error.

When I set local IP to advertise_ip (so that Zookeeper broadcasted local IPs), masters could communicate and QUORUM=2 worked. When I removed the firewall rule, slaves could connect to the master.

Upvotes: 1

Related Questions