A-y
A-y

Reputation: 793

Nodes won't join cluster : NotMasterException (Weird master election bug)

I'm setting up an elasticsearch (5.0.1) cluster.

It has three master-eligible nodes :

el-m01
el-m02
el-m03

The cluster fails to assemble, and Every master node gets the following NotMasterException exception in the logs :

[2016-11-21T15:24:13,274][INFO ][o.e.d.z.ZenDiscovery     ] [el-m01] failed to send join request to master [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}], reason [RemoteTransportException[[el-m02][192.168.110.118:9300][internal:discovery/zen/join]]; nested: NotMasterException[Node [{el-m02}{bBhsu3fJSj-MyiWJGhQmog}{_IzdeUd4Sv6g-rhemGjEVQ}{192.168.110.118}{192.168.110.118:9300}{rack=r1}] not master for join request]; ], tried [3] times

Enabling the debugging logs allowed me to understand the following :

The master election is happening, and is a success. However, while every node has chosen a master, no nodes thinks he is the master. i.e. :

What is happening here?

Upvotes: 5

Views: 4300

Answers (2)

PhaedrusTheGreek
PhaedrusTheGreek

Reputation: 584

The Elasticsearch data directory $ES_HOME/data, or in the case of RPM, e.g., /var/lib/elasticsearch contains a randomly generated node ID when Elasticsearch is first started. If this directory is copied to multiple instances that are expected to form a cluster, the following error should be received:

failed to send join request to master [..] IllegalArgumentException [..] found existing node [..] with the same id but is a different node instance

However, when minimum_master_nodes is not met, an error less indicative of the problem is received:

failed to send join request to master [..] NotMasterException [..] not master for join request

Github: https://github.com/elastic/elasticsearch/issues/32904

The issue can be resolved by deleting the contents of the data directory, and data directories shouldn't be copied in the first place.

Upvotes: 1

A-y
A-y

Reputation: 793

Here is the situation : By cloning a VM to get all the masters, every node has the same node id.

This can be verified with the following command, listing all nodes ids :

GET /_cat/nodes?v&h=id,ip,name&full_id=true

Note that since your cluster hasn't formed, each node needs to be queried individually, i.e :

curl 192.168.110.111:9200/_cat/nodes?v&h=id,ip,name&full_id=true
curl 192.168.110.112:9200/_cat/nodes?v&h=id,ip,name&full_id=true
(...)

This is bad. the node ids need to be unique.

To solve this situation, you need to delete the indices (in /var/lib/elasticsearch) on every node. This will delete all data in elasticsearch, and will also reset the node ids.

To avoid having this problem in the first place, you can :

  • A. install elasticsearch after having cloned the VMs
  • B. use an automated tool like ansible or puppet to manage elasticsearch.

Upvotes: 19

Related Questions