Reputation: 11
I have a big problem with a cluster elasticsearch. I have 3 nodes, one node has stopping elasticsearch and the cluster goes to red, i have restart all nodes with service elasticsearch restart
, now all node are connected and start to resharding but after about two hours in the master node , one process of Elasticsearch uses 100% of cpu and is not responding on port 9200/9300 , so the cluster fall...this is repeated each time the cluster is restarted , regardless of what the master
I do not know what to do , I'm desperate , someone can help me ?
UPDATE The configuration of cluster is:
cluster.name: es-cluster
node.name: es-node1
bootstrap.mlockall: true
discovery.zen.ping.unicast.hosts: ["ec2-52-208-103-xxx.eu-west-1.compute.amazonaws.com", "ec2-52-51-160-xxx.eu-west-1.compute.amazonaws.com", "ec2-52-208-167-xxx.eu-west-1.compute.amazonaws.com"]
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
node.master: true
node.data: true
network.bind_host: 0.0.0.0
network.publish_host: ec2-52-208-103-xxx.eu-west-1.compute.amazonaws.com
Is the same configuration for all node exception network.publish_host
and node.name
Now the cluster id reduced to 2 nodes and the resharding is in progress, when finished i can use the cluster anyway?
Maybe it's the wrong configuration ? It is working properly for months
Upvotes: 0
Views: 2711
Reputation: 8274
What version of Elasticsearch? Kind of matter in terms of bugs you might be running into.
What state is your cluster in? Check /_cluster/health
Check the logs for errors on each node. Probably all your nodes are either garbage collecting and out of memory. If so, the log will be full with garbage collect related warnings and possibly some OutOfMemoryExceptions as well. That would fully explain them being unresponsive; which can cause all sorts of issues with cluster management. This is why they recommend separating master nodes from data nodes in larger setups.
Once you fix unresponsive nodes (i.e. stop indexing, if you still are, restart if that doesn't help). You can try to use the /_cat/shards and /_cat/indices apis to figure out which indices are problematic. Also the logs will tell you if there are any problems with specific shards.
Your cluster is red at this point probably due to your earlier restart (don't ever do this, this is a sure way to get your cluster from yellow to red). So you are probably going to lose some data. You probably have several unassigned shards as well. If you still have a primary shard, you could try reducing the number of replicas to 0 and then increase it again (dangerous, be careful). This can sometimes help nudge a cluster back to health. Alternatively if you don't care about the affected indices, delete them.
In the happy case your cluster is yellow, you can try adding more nodes and rerouting shards there. After your cluster goes green, you can try taking down problematic nodes one by one (don't do this on a yellow cluster, ever).
If/when you get things up and running, you need to address the reasons you are running out of memory or this will happen again. It's not an infinite datastore. You are likely either running expensive queries, or indexing too much data, or doing something else that clearly doesn't scale.
I had a similar situation just a few weeks ago and root-caused it to an out of control aggregation query combined with huge shards with lots of field data going on the heap (this was a 1.x cluster). Also we were running into known issues with 1.7.4 that were preventing the cluster from rebalancing. I fixed it mitigated as follows: 1) delete old data that I did not need to reduce shard size 2) increase number of shards so each shard is smaller 3) fix the query to be less expensive. 4) upgrade to 1.7.5 to prevent the same bug killing my cluster again.
Upvotes: 1