Overriding `tcp.publish_port` breaks clustering when elasticsearch is in a container

Question

I'm trying to run an elasticsearch cluster with each es-node running in its own container. These containers are deployed using ECS across several machines that may be running other unrelated containers. To avoid port conflicts each port a container exposes is assigned a random value. These random ports are consistent across all running containers of the same type. In other words, all running es-node containers map port 9300 to the same random number.

Here's the config I'm using:

network:
  host: 0.0.0.0

plugin:
  mandatory: cloud-aws

cluster:
  name: ${ES_CLUSTER_NAME}

discovery:
  type: ec2
  ec2:
    groups: ${ES_SECURITY_GROUP}
    any_group: false
  zen.ping.multicast.enabled: false

transport:
  tcp.port: 9300
  publish_port: ${_INSTANCE_PORT_TRANSPORT}

cloud.aws:
  access_key: ${AWS_ACCESS_KEY}
  secret_key: ${AWS_SECRET_KEY}
  region: ${AWS_REGION}

In this case _INSTANCE_PORT_TRANSPORT is the port that 9300 is bound to on the host machine. I've confirmed that all the environment variables used above are set correctly. I'm also setting network.publish_host to the host machine's local IP via a command line arg.

When I forced _INSTANCE_PORT_TRANSPORT (and in turn transport.publish_port) to be 9300, everything worked great, but as soon as it's given a random value, nodes can no longer connect to each other. I see errors like this using logger.discovery=TRACE:

ConnectTransportException[[][10.0.xxx.xxx:9300] connect_timeout[30s]]; nested: ConnectException[Connection refused: /10.0.xxx.xxx:9300];
    at org.elasticsearch.transport.netty.NettyTransport.connectToChannelsLight(NettyTransport.java:952)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:916)
    at org.elasticsearch.transport.netty.NettyTransport.connectToNodeLight(NettyTransport.java:888)
    at org.elasticsearch.transport.TransportService.connectToNodeLight(TransportService.java:267)
    at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:395)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

It seems like the port a node binds to is the same as the port it pings while trying to connect to other nodes. Is there any way to make them different? If not, what's the point of transport.publish_port?

dadoonet · Accepted Answer

The way the discovery-ec2 plugin works is that it's collecting a list of IP addresses using AWS EC2 API and use this list as unicast list of nodes.

But it does not collect any information from the running cluster. Obviously the node is not yet connected! So it does not know anything about the publish_port of other nodes.

It just adds an IP address. And that's all. Elasticsearch then is using the default port which is 9300.

So there is nothing you can do IMO to fix that in the short time.

But we can imagine adding a new feature which is close to what has been implemented for Google Compute Engine. We are using a specific metadata to get this port from the GCE APIs.

We could do the same for Azure and EC2. Do you want to open an issue so we can track the effort?

Overriding `tcp.publish_port` breaks clustering when elasticsearch is in a container

Answers (1)

Related Questions