Reputation: 53
After solving Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then there is still something off in my cluster. Whenever I run the snapshot command it gives me a 503, when I run it one or two times again it suddenly starts and creates a snapshot just fine. The opster.com online tool suggests something about snapshots not being configured, however when I run the verify command suggested by it, everything looks fine.
$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
"nodes" : {
"JZHgYyCKRyiMESiaGlkITA" : {
"name" : "elastic7-1"
},
"jllZ8mmTRQmsh8Sxm8eDYg" : {
"name" : "elastic7-4"
},
"TJJ_eHLIRk6qKq_qRWmd3w" : {
"name" : "elastic7-3"
},
"cI-cn4V3RP65qvE3ZR8MXQ" : {
"name" : "elastic7-2"
}
}
}
But then:
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.11.27] a snapshot is already running"
},
"status" : 503
}
Could it be that one of the 4 nodes is in the believe that a snapshot is already running, and that this task randomly gets assigned to one of the nodes so that when running it a few times eventually it will make a snapshot? If so, how could I figure out which of the nodes is saying the snapshot is already running?
Furthermore I noticed heap is much higher on one of the nodes, what is a normal heap usage?
$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215 59 99 7 0.38 0.38 0.36 dilm - elastic7-1
10.0.1.218 32 99 1 0.02 0.17 0.22 dilm * elastic7-4
10.0.1.212 11 99 1 0.04 0.17 0.21 dilm - elastic7-3
10.0.1.209 36 99 3 0.42 0.40 0.36 dilm - elastic7-2
Last night it happened again while I’m sure nothing was already snapshotting and so now I ran the following commands to confirm the weird response, at least I would not expect to get this error at this point.
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
"snapshots" : [ ]
}
$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.03] a snapshot is already running"
},
"status" : 503
}
When I run it a 2nd (or sometimes 3rd) time it will all of a sudden be creating a snapshot.
And note that when I don't run it that 2nd or 3rd times the snapshot will never appear, so I'm 100% sure no snapshot is running at the moment of this error.
There is no SLM configured as far as I know:
{ }
The repo is configured properly AFAICT:
$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
"elastic_backup" : {
"type" : "fs",
"settings" : {
"compress" : "true",
"location" : "elastic_backup"
}
}
}
Also in the config it is mapped to the same folder that is an NFS mount of an Amazon EFS. It is available and accessible and on successful snapshots shows new data.
As part of the cronjob I have added to query _cat/tasks?v
, so hopefully tonight we will see more. Because just now when I ran the command manually it ran without problems:
$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15885091 - transport 1607068277045 07:51:17 209.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17 62.7micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct 1607068277045 07:51:17 57.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17 84.7micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17 56.9micros 10.0.1.209 elastic7-2
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.04",
"uuid" : "u2yQB40sTCa8t9BqXfj_Hg",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-1-2020.06.18-000003",
"log-db-2-2020.02.19-000002",
"log-db-1-2019.10.25-000001",
"log-db-3-2020.11.23-000002",
"log-db-3-2019.10.25-000001",
"log-db-2-2019.10.25-000001",
"log-db-1-2019.10.27-000002"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-04T07:51:17.085Z",
"start_time_in_millis" : 1607068277085,
"end_time" : "2020-12-04T07:51:48.537Z",
"end_time_in_millis" : 1607068308537,
"duration_in_millis" : 31452,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
action task_id parent_task_id type start_time timestamp running_time ip node
indices:data/read/search JZHgYyCKRyiMESiaGlkITA:15888939 - transport 1607068308987 07:51:48 2.7ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:15888942 - transport 1607068308990 07:51:48 223.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48 61.5micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct 1607068308990 07:51:48 78.2micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48 63.3micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48 60micros 10.0.1.209 elastic7-2
Last night (2020-12-12) during cron I have had it run the following commands:
curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
And the output for it is following:
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016838 - transport 1607736001255 01:20:01 314.4micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01 66micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 74micros 10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct 1607736001255 01:20:01 94.3micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01 73.6micros 10.0.1.209 elastic7-2
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"error" : {
"root_cause" : [
{
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
}
],
"type" : "concurrent_snapshot_execution_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] a snapshot is already running"
},
"status" : 503
}
action task_id parent_task_id type start_time timestamp running_time ip node
cluster:monitor/nodes/stats JZHgYyCKRyiMESiaGlkITA:78016874 - transport 1607736001632 01:20:01 39.6ms 10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01 39.2ms 10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.5ms 10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01 39.4ms 10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct 1607736001632 01:20:01 39.5ms 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists JZHgYyCKRyiMESiaGlkITA:78016880 - transport 1607736001671 01:20:01 348.9micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct 1607736001671 01:20:01 188.6micros 10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01 106.2micros 10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 96.3micros 10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01 97.8micros 10.0.1.218 elastic7-4
node_name name active queue rejected
elastic7-2 snapshot 0 0 0
elastic7-4 snapshot 0 0 0
elastic7-1 snapshot 0 0 0
elastic7-3 snapshot 0 0 0
{
"snapshot" : {
"snapshot" : "snapshot-2020.12.12",
"uuid" : "DgwuBxC7SWirjyVlFxBnng",
"version_id" : 7040099,
"version" : "7.4.0",
"indices" : [
"log-db-sbr-2020.06.18-000003",
"log-db-other-2020.02.19-000002",
"log-db-sbr-2019.10.25-000001",
"log-db-trace-2020.11.23-000002",
"log-db-trace-2019.10.25-000001",
"log-db-sbr-2019.10.27-000002",
"log-db-other-2019.10.25-000001"
],
"include_global_state" : true,
"state" : "SUCCESS",
"start_time" : "2020-12-12T01:20:02.544Z",
"start_time_in_millis" : 1607736002544,
"end_time" : "2020-12-12T01:20:27.776Z",
"end_time_in_millis" : 1607736027776,
"duration_in_millis" : 25232,
"failures" : [ ],
"shards" : {
"total" : 28,
"failed" : 0,
"successful" : 28
}
}
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
{
"error" : {
"root_cause" : [
{
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
}
],
"type" : "invalid_snapshot_name_exception",
"reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
},
"status" : 400
}
Also the cluster is green at the moment, management queues are not full, everything seems good.
Also there is only one repository:
curl http://127.0.0.1:9201/_cat/repositories?v
id type
elastic_backup fs
Upvotes: 3
Views: 1664
Reputation: 53
So it turned out that trouble started due to a recent upgrade to Docker 19.03.6 and going from 1x Docker Swarm manager + 4x Docker Swarm worker to 5x Docker Swarm manager + 4x Docker Swarm worker. In both instances Elastic ran on the workers. Because of this upgrade/change we were presented with a change in the number of network interfaces inside the containers. Because of this we had to had 'publish_host' in Elastic to make things work again.
To fix the problem we had to get rid of publishing the Elastic ports over the ingress network so that the additional network interfaces went away. Next we could drop the 'publish_host' setting. This made things work a bit better. But to really solve our issues we had to change the Docker Swarm deploy endpoint_mode to dnsrr so that things would not go through the Docker Swarm routing mesh.
We always already had 'Connection reset by peer' issues, but since the change this became worse and made Elasticsearch present strange issues. I guess running Elasticsearch inside a Docker Swarm (or any other Kubernetes or something) can be a tricky thing to debug.
Using tcpdump in the containers and conntrack -S on the hosts we were able to see perfectly fine connections being reset for no reason. Another solution was to have the kernel drop mismatching packets (instead of sending resets), but preventing the use of DNAT/SNAT in this instance as much as possible seemed to solve things too.
Upvotes: 1
Reputation: 281
Elasticsearch version 7.4 only supports one snapshots operation at a time.
From the error it seems previously triggered snapshot was already running when you triggered a new snapshot and Elasticsearch throws concurrent_snapshot_execution_exception
.
You can check list of currently running snapshot by using
GET /_snapshot/elastic_backup/_current
.
I suggest you should check first if any snapshot operation is running for your elasticsearch cluster using above api. If no snapshot operation is currently running then only you should trigger new snapshot.
P.S : From Elasticsearch version 7.7 onwards elasticsearch do support concurrent snapshots as well. So if you plan to perform concurrent snapshots operation in you cluster then you should upgrade ES version 7.7 or above.
Upvotes: 0