Reputation: 41
I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.
Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.
This all works perfectly, until a singe etcd2 node is unavailable for any reason.
If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.
As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.
Subsequent tests yield identical results.
Am I hitting a known bug in etcd2, fleet or CoreOS?
Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?
Upvotes: 4
Views: 513
Reputation: 147
I've experienced the same thing. In my case, when I ran 1 specific unit it caused everything to blow up. Scheduled and perfectly fine running units were suddenly lost without any notice, even machines dropping out of the cluster.
I'm still not sure what the exact problem was, but I think it might have had something to do with etcd vs etcd2. I had a dependency of etcd.service in the unit file, which (I think, not sure) caused CoreOS to try and start etcd.service, while etcd2.service was already running. This might have caused the conflict in my case, and messed up the etcd registry of units and machines.
Something similar might be happening to you, so I suggest you check each host whether you're running etcd or etcd2 and check your unit files to see which one they depend on.
Upvotes: 2