ElasticSearch Issues Every 2 hours and 15 minutes

Question

Roughly every 2 hours and 15 minutes, we lose on node that is in another datacenter. I cannot figure out what the issue may be. Has anyone seen this before / have any experience with this?

[2016-08-11 07:42:14,886][INFO ][cluster.routing.allocation] [node-exp-01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [[{node-cl-01}{xxxxxxxxxxxxxxxxxxx}{192.168.41.100}{192.168.41.100:9300}{rack=century-link}] failed]).
[2016-08-11 07:42:14,886][INFO ][cluster.service          ] [node-exp-01] removed {{node-cl-01}{xxxxxxxxxxxxxxxxxxx}{192.168.41.100}{192.168.41.100:9300}{rack=century-link},}, reason: zen-disco-node_failed({node-cl-01}{xxxxxxxxxxxxxxxxxxx}{192.168.41.100}{192.168.41.100:9300}{rack=century-link}), reason transport disconnected
[2016-08-11 07:42:14,891][INFO ][cluster.routing          ] [node-exp-01] delaying allocation for [6] unassigned shards, next check in [1m]
[2016-08-11 07:42:19,402][INFO ][cluster.service          ] [node-exp-01] added {{node-cl-01}{xxxxxxxxxxxxxxxxxxx}{192.168.41.100}{192.168.41.100:9300}{rack=century-link},}, reason: zen-disco-join(join from node[{node-cl-01}{xxxxxxxxxxxxxxxxxxx}{192.168.41.100}{192.168.41.100:9300}{rack=century-link}])
[2016-08-11 07:42:20,728][INFO ][cluster.routing.allocation] [node-exp-01] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[recordings][3]] ...]).

Greatly appreciated,

Thanks!

new · Accepted Answer

Yellow means that ES has allocated all of the primary shards, but some orall of the replicas have not been allocated it is not so dramatic.

Now to to find reasen it is not so easy. I guess you had no traffic between the locations for some time and ES needs tcp keepalive message on the long living connections to keep them persistent. (between the nodes) Check your underlying OS tcp keepalive timeout which should be as low as, e.g. 600 seconds, after that time, the first tcp keepalive message is sent. Also consider a lower interval for the keepalive messages.

Use the cluster help api and print the res.

https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html

ElasticSearch Issues Every 2 hours and 15 minutes

Answers (1)

Related Questions