Reputation: 53

Elasticsearch 7.4 incorrectly complaining a snapshot is already running

After solving Something inside Elasticsearch 7.4 cluster is getting slower and slower with read timeouts now and then there is still something off in my cluster. Whenever I run the snapshot command it gives me a 503, when I run it one or two times again it suddenly starts and creates a snapshot just fine. The opster.com online tool suggests something about snapshots not being configured, however when I run the verify command suggested by it, everything looks fine.

$ curl -s -X POST 'http://127.0.0.1:9201/_snapshot/elastic_backup/_verify?pretty'
{
  "nodes" : {
    "JZHgYyCKRyiMESiaGlkITA" : {
      "name" : "elastic7-1"
    },
    "jllZ8mmTRQmsh8Sxm8eDYg" : {
      "name" : "elastic7-4"
    },
    "TJJ_eHLIRk6qKq_qRWmd3w" : {
      "name" : "elastic7-3"
    },
    "cI-cn4V3RP65qvE3ZR8MXQ" : {
      "name" : "elastic7-2"
    }
  }
}

But then:

curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.11.27]  a snapshot is already running"
  },
  "status" : 503
}

Could it be that one of the 4 nodes is in the believe that a snapshot is already running, and that this task randomly gets assigned to one of the nodes so that when running it a few times eventually it will make a snapshot? If so, how could I figure out which of the nodes is saying the snapshot is already running?

Furthermore I noticed heap is much higher on one of the nodes, what is a normal heap usage?

$ curl -s http://127.0.0.1:9201/_cat/nodes?v
ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
10.0.1.215           59          99   7    0.38    0.38     0.36 dilm      -      elastic7-1
10.0.1.218           32          99   1    0.02    0.17     0.22 dilm      *      elastic7-4
10.0.1.212           11          99   1    0.04    0.17     0.21 dilm      -      elastic7-3
10.0.1.209           36          99   3    0.42    0.40     0.36 dilm      -      elastic7-2

Last night it happened again while I’m sure nothing was already snapshotting and so now I ran the following commands to confirm the weird response, at least I would not expect to get this error at this point.

$ curl http://127.0.0.1:9201/_snapshot/elastic_backup/_current?pretty
{
  "snapshots" : [ ]
}

$ curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "concurrent_snapshot_execution_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
      }
    ],
    "type" : "concurrent_snapshot_execution_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.03]  a snapshot is already running"
  },
  "status" : 503
}

When I run it a 2nd (or sometimes 3rd) time it will all of a sudden be creating a snapshot.

And note that when I don't run it that 2nd or 3rd times the snapshot will never appear, so I'm 100% sure no snapshot is running at the moment of this error.

There is no SLM configured as far as I know:

{ }

The repo is configured properly AFAICT:

$ curl http://127.0.0.1:9201/_snapshot/elastic_backup?pretty
{
  "elastic_backup" : {
    "type" : "fs",
    "settings" : {
      "compress" : "true",
      "location" : "elastic_backup"
    }
  }
}

Also in the config it is mapped to the same folder that is an NFS mount of an Amazon EFS. It is available and accessible and on successful snapshots shows new data.

As part of the cronjob I have added to query _cat/tasks?v, so hopefully tonight we will see more. Because just now when I ran the command manually it ran without problems:

$ curl localhost:9201/_cat/tasks?v ; curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty' ; curl localhost:9201/_cat/tasks?v     
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                        
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15885091 -                               transport 1607068277045 07:51:17  209.6micros  10.0.1.215 elastic7-1                                                  
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24278976 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277044 07:51:17  62.7micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15885092 JZHgYyCKRyiMESiaGlkITA:15885091 direct    1607068277045 07:51:17  57.4micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23773565 JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277045 07:51:17  84.7micros   10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3418325  JZHgYyCKRyiMESiaGlkITA:15885091 transport 1607068277046 07:51:17  56.9micros   10.0.1.209 elastic7-2                                               
{                                                                                                                                                                  
  "snapshot" : {                                                                                                                                                   
    "snapshot" : "snapshot-2020.12.04",                                                                                                                            
    "uuid" : "u2yQB40sTCa8t9BqXfj_Hg",                                                                                                                                                                          
    "version_id" : 7040099,                                                                                                                                        
    "version" : "7.4.0",                                                                                                                                           
    "indices" : [                                                                                                                                                  
        "log-db-1-2020.06.18-000003",
        "log-db-2-2020.02.19-000002",
        "log-db-1-2019.10.25-000001",
        "log-db-3-2020.11.23-000002",
        "log-db-3-2019.10.25-000001",
        "log-db-2-2019.10.25-000001",
        "log-db-1-2019.10.27-000002"                                                                                                                              
    ],                                                                                                                                                             
    "include_global_state" : true,                                                                                                                                                                              
    "state" : "SUCCESS",                                                                                                                                           
    "start_time" : "2020-12-04T07:51:17.085Z",                                                                                                                                                                  
    "start_time_in_millis" : 1607068277085,                                                                                                                        
    "end_time" : "2020-12-04T07:51:48.537Z",                                                                                                                        
    "end_time_in_millis" : 1607068308537,                                                                                                                                 
    "duration_in_millis" : 31452,                                                                                                                                         
    "failures" : [ ],                                                                                                                                                     
    "shards" : {                                                                                                                                                          
      "total" : 28,                                                                                                                                                       
      "failed" : 0,                                                                                                                                                       
      "successful" : 28                                                                                                                                                   
    }                                                                                                                                                                     
  }                                                                                                                                                                       
}                                                                                                                                                                                                               
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node                                                     
indices:data/read/search       JZHgYyCKRyiMESiaGlkITA:15888939 -                               transport 1607068308987 07:51:48  2.7ms        10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:15888942 -                               transport 1607068308990 07:51:48  223.2micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:24282763 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308989 07:51:48  61.5micros   10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:15888944 JZHgYyCKRyiMESiaGlkITA:15888942 direct    1607068308990 07:51:48  78.2micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:23777841 JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308990 07:51:48  63.3micros   10.0.1.218 elastic7-4                                             
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:3422139  JZHgYyCKRyiMESiaGlkITA:15888942 transport 1607068308991 07:51:48  60micros     10.0.1.209 elastic7-2

Last night (2020-12-12) during cron I have had it run the following commands:

curl localhost:9201/_cat/tasks?v
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
curl localhost:9201/_cat/tasks?v
sleep 1 
curl localhost:9201/_cat/thread_pool/snapshot?v
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'
sleep 1
curl -s -X PUT 'http://127.0.0.1:9201/_snapshot/elastic_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E?wait_for_completion=true&pretty'

And the output for it is following:

action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016838 -                               transport 1607736001255 01:20:01  314.4micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228580 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001254 01:20:01  66micros     10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806094 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  74micros     10.0.1.218 elastic7-4
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016839 JZHgYyCKRyiMESiaGlkITA:78016838 direct    1607736001255 01:20:01  94.3micros   10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582174 JZHgYyCKRyiMESiaGlkITA:78016838 transport 1607736001255 01:20:01  73.6micros   10.0.1.209 elastic7-2
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{            
  "error" : {       
    "root_cause" : [
      {                                            
        "type" : "concurrent_snapshot_execution_exception",                                                                                      
        "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
      }
    ],                                         
    "type" : "concurrent_snapshot_execution_exception",                                                                                      
    "reason" : "[elastic_backup:snapshot-2020.12.12]  a snapshot is already running"
  },            
  "status" : 503
}
action                         task_id                         parent_task_id                  type      start_time    timestamp running_time ip         node
cluster:monitor/nodes/stats    JZHgYyCKRyiMESiaGlkITA:78016874 -                               transport 1607736001632 01:20:01  39.6ms       10.0.1.215 elastic7-1
cluster:monitor/nodes/stats[n] TJJ_eHLIRk6qKq_qRWmd3w:82228603 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001631 01:20:01  39.2ms       10.0.1.212 elastic7-3
cluster:monitor/nodes/stats[n] jllZ8mmTRQmsh8Sxm8eDYg:55806114 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.5ms       10.0.1.218 elastic7-4
cluster:monitor/nodes/stats[n] cI-cn4V3RP65qvE3ZR8MXQ:63582204 JZHgYyCKRyiMESiaGlkITA:78016874 transport 1607736001632 01:20:01  39.4ms       10.0.1.209 elastic7-2
cluster:monitor/nodes/stats[n] JZHgYyCKRyiMESiaGlkITA:78016875 JZHgYyCKRyiMESiaGlkITA:78016874 direct    1607736001632 01:20:01  39.5ms       10.0.1.215 elastic7-1
cluster:monitor/tasks/lists    JZHgYyCKRyiMESiaGlkITA:78016880 -                               transport 1607736001671 01:20:01  348.9micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] JZHgYyCKRyiMESiaGlkITA:78016881 JZHgYyCKRyiMESiaGlkITA:78016880 direct    1607736001671 01:20:01  188.6micros  10.0.1.215 elastic7-1
cluster:monitor/tasks/lists[n] TJJ_eHLIRk6qKq_qRWmd3w:82228608 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001671 01:20:01  106.2micros  10.0.1.212 elastic7-3
cluster:monitor/tasks/lists[n] cI-cn4V3RP65qvE3ZR8MXQ:63582209 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  96.3micros   10.0.1.209 elastic7-2
cluster:monitor/tasks/lists[n] jllZ8mmTRQmsh8Sxm8eDYg:55806120 JZHgYyCKRyiMESiaGlkITA:78016880 transport 1607736001672 01:20:01  97.8micros   10.0.1.218 elastic7-4
node_name  name     active queue rejected
elastic7-2 snapshot      0     0        0
elastic7-4 snapshot      0     0        0
elastic7-1 snapshot      0     0        0
elastic7-3 snapshot      0     0        0
{
  "snapshot" : {
    "snapshot" : "snapshot-2020.12.12",
    "uuid" : "DgwuBxC7SWirjyVlFxBnng",
    "version_id" : 7040099,
    "version" : "7.4.0",
    "indices" : [
      "log-db-sbr-2020.06.18-000003",
      "log-db-other-2020.02.19-000002",
      "log-db-sbr-2019.10.25-000001",
      "log-db-trace-2020.11.23-000002",
      "log-db-trace-2019.10.25-000001",
      "log-db-sbr-2019.10.27-000002",
      "log-db-other-2019.10.25-000001"
    ],
    "include_global_state" : true,
    "state" : "SUCCESS",
    "start_time" : "2020-12-12T01:20:02.544Z",
    "start_time_in_millis" : 1607736002544,
    "end_time" : "2020-12-12T01:20:27.776Z",
    "end_time_in_millis" : 1607736027776,
    "duration_in_millis" : 25232,
    "failures" : [ ],
    "shards" : {
      "total" : 28,
      "failed" : 0,
      "successful" : 28
    }
  }
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "invalid_snapshot_name_exception",
        "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
      }
    ],
    "type" : "invalid_snapshot_name_exception",
    "reason" : "[elastic_backup:snapshot-2020.12.12] Invalid snapshot name [snapshot-2020.12.12], snapshot with the same name already exists"
  },
  "status" : 400
}

Also the cluster is green at the moment, management queues are not full, everything seems good.

Also there is only one repository:

curl http://127.0.0.1:9201/_cat/repositories?v
id             type
elastic_backup   fs

Upvotes: 3

Answers (2)

Lourens Rozema

Reputation: 53

So it turned out that trouble started due to a recent upgrade to Docker 19.03.6 and going from 1x Docker Swarm manager + 4x Docker Swarm worker to 5x Docker Swarm manager + 4x Docker Swarm worker. In both instances Elastic ran on the workers. Because of this upgrade/change we were presented with a change in the number of network interfaces inside the containers. Because of this we had to had 'publish_host' in Elastic to make things work again.

To fix the problem we had to get rid of publishing the Elastic ports over the ingress network so that the additional network interfaces went away. Next we could drop the 'publish_host' setting. This made things work a bit better. But to really solve our issues we had to change the Docker Swarm deploy endpoint_mode to dnsrr so that things would not go through the Docker Swarm routing mesh.

We always already had 'Connection reset by peer' issues, but since the change this became worse and made Elasticsearch present strange issues. I guess running Elasticsearch inside a Docker Swarm (or any other Kubernetes or something) can be a tricky thing to debug.

Using tcpdump in the containers and conntrack -S on the hosts we were able to see perfectly fine connections being reset for no reason. Another solution was to have the kernel drop mismatching packets (instead of sending resets), but preventing the use of DNAT/SNAT in this instance as much as possible seemed to solve things too.

Upvotes: 1

piyush daftary

Reputation: 281

Elasticsearch version 7.4 only supports one snapshots operation at a time.

From the error it seems previously triggered snapshot was already running when you triggered a new snapshot and Elasticsearch throws concurrent_snapshot_execution_exception.

You can check list of currently running snapshot by using GET /_snapshot/elastic_backup/_current.

I suggest you should check first if any snapshot operation is running for your elasticsearch cluster using above api. If no snapshot operation is currently running then only you should trigger new snapshot.

P.S : From Elasticsearch version 7.7 onwards elasticsearch do support concurrent snapshots as well. So if you plan to perform concurrent snapshots operation in you cluster then you should upgrade ES version 7.7 or above.

Upvotes: 0

Elasticsearch 7.4 incorrectly complaining a snapshot is already running

Answers (2)

Related Questions