Consul-Agent architecture .. the node-id issue after upgrading to 0.8.1 - conceptual issue?

Question

i am not sure where the root of my problem actually comes from, so i try to explain the bigger picture.

In short, the symptom: After upgrading consul from 0.7.3 to 0.8.1 my agents ( explaining that below ) could no longer connect to the cluster leader due to dublicated node-ids ( why that probably happens, explained below). I could neither fix it with https://www.consul.io/docs/agent/options.html#_disable_host_node_id nor fully understand, why i run into this .. and thats where the bigger picture and maybe even different questions comes from.

I have the following setup:

I run a application stack with about 8 containers for different services ( different micoservices, DB-types and so on).
I use a single consul server per stack (yes the consul server runs in the software stack, it has its reasons because i need this to be offline-deployable and every stack lives for itself)
The consul-server does handle the registration, service discovery and also KV/configuration
Important/Questionable: Every container has a consul agent started with with "consul agent -config-dir /etc/consul.d" .. connecting the this one server. The configuration looks like this .. including to other files with they encrypt token / acl token. Do not wonder about servicename() .. it replaced by a m4 macro during image build time
The clients are secured by a gossip key and ACL keys
Important: All containers are on the same hardware node
Server configuration looks like this, if any important. In addition, ACLs looks like this, and a ACL-master and client token/gossip json files are in that configurtion folder

Sorry for this probably TLTR above, but the reasons behind all the explanation was, this multi-agent setup ( or 1-agent per container ).

My reasons for that:

I use tiller to configure the containers, so a dimploy gem will try to usually connect to localhost:8500 .. to acomplish that without making the consul-configuration extraordinary complicated, i use this local agent, which then forwards the request to the actual server and thus handles all the encryption-key/ACL negation stuff
i use several 'consul watch' tasks on the server to trigger re-configuration, they also run on localhost:8500 without any extra configuration

That said, the reason i run a 1-agent/container is, the simplicity for local services to talk to the consul-backend without really knowing about authentication as long as they connect through 127.0.0.1:8500 ( as the level of security )

Final Question:

Is that multi-consul agent actually designed to be used that way? The reason i ask is, because as far as i understand, the node-id duplication issue i get now when starting a 0.8.1 comes from "the host" being the same, so the hardware node being identical for all consul-agents .. right?

Is my design wrong or do i need to generate my own node-ids from now on and its all just fine?

Eugen Mayer · Accepted Answer

Seem like this issue has been identified by Hashicorp and addressed in https://github.com/hashicorp/consul/blob/master/CHANGELOG.md#085-june-27-2017 where -disable-host-node-id has been set to true by default, thus the node-id is no longer generated from the host hardware but a random uuid, which solves the issue i had running several consul nodes on the same physical hardware

So the way i deployed was fine.

Consul-Agent architecture .. the node-id issue after upgrading to 0.8.1 - conceptual issue?

Answers (1)

Related Questions