tuk
tuk

Reputation: 6892

Elasticsearch 7.7.1 shards getting unassigned

We recently upgraded our elasticsearch cluster from 5.6.16 to 7.7.1.

After that sometimes I am observing that few of the shards not getting assigned.

My node stats is placed here.

The allocation explanation for an unassigned shard is like below

ubuntu@platform2:~$      curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
> {
>   "index": "denorm",
>   "shard": 14,
>   "primary": false
> }
> '
{
  "index" : "denorm",
  "shard" : 14,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-11-19T13:09:42.072Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "0_00hk5IRcmgrHGYjpV1jA",
      "node_name" : "platform2",
      "transport_address" : "10.62.70.178:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "9ltF-KXGRk-xMF_Ef1DAng",
      "node_name" : "platform3",
      "transport_address" : "10.62.70.179:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[denorm][14], node[9ltF-KXGRk-xMF_Ef1DAng], [P], s[STARTED], a[id=SNyCoFUzSwaiIE4187Tfig]]"
        }
      ]
    },
    {
      "node_id" : "ocKks7zJT7OODhse-yveyg",
      "node_name" : "platform1",
      "transport_address" : "10.62.70.177:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

As mentioned here in node stats I am observing that out of 4.3 GB heap only about ~85 MB is used for keeping in-memory data structures.

As discussed here on setting indices.breaker.total.use_real_memory: false I am not seeing the Data too large exception.

Can some one let me know how can I confirm if I am observing the same issue as discussed here?

I did not see this issue with elasticsearch 5.6.16.

Upvotes: 2

Views: 2150

Answers (1)

Amit
Amit

Reputation: 32386

As pointed by @Val in the comment, Due to your circuit breaker configuration in ES 7.X version, ES couldn't allocate the shard on other data nodes, and now left with node which also holds primary shard.

{
      "node_id" : "9ltF-KXGRk-xMF_Ef1DAng",
      "node_name" : "platform3",
      "transport_address" : "10.62.70.179:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[denorm][14], node[9ltF-KXGRk-xMF_Ef1DAng], [P], s[STARTED], a[id=SNyCoFUzSwaiIE4187Tfig]]"
        }
      ]
    },

Note the Error message **CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb]. **

As, Elasticsearch never assigns the primary and its replica shard on the same node due to high availability reasons, hence try to fix the circuit breaker exceptions by tunning its setting to fix the issue.

default allocation retry is just 5 times, and after fixing the issue you can again retry the allocation, by using the below command.

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

After running the above command, if you still have a few failures, you might have to reroute the shards manually for which you can reroute API

For more background and detailed reading follow this and this link.

Upvotes: 3

Related Questions