toadjaune
toadjaune

Reputation: 881

Elasticsearch _search query always runs on every index

I'm having an issue with a Kibana Dashboard, which complains with multiple Courier Fetch: xxx of 345 shards failed. warning messages every time I reload it.

Okay, I'm asking for data spanning over the last 15 minutes, and I have an index per day. There is no way today's index contains 345 shards. So, why does my query span over so many shards ?


Things I have checked :

GET 20181027_logs/_search
{
"query": {
    "bool": {
      "must": [
        {
          "range": {
            "timestamp": {
              "gte": 1543326215000,
              "lte": 1543329815000,
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

Answer (truncated) :

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1557,

Same query without restricting the index :

GET *_logs/_search
{
"query": {
    "bool": {
      "must": [
        {
          "range": {
            "timestamp": {
              "gte": 1543326215000,
              "lte": 1543329815000,
              "format": "epoch_millis"
            }
          }
        }
      ]
    }
  }
}

Answer (truncated) :

{
  "took": 24,
  "timed_out": false,
  "_shards": {
    "total": 345,
    "successful": 345,
    "failed": 0
  },
  "hits": {
    "total": 1557,

We can see that the second query returns exactly the same results than the first one, but searches through every index.

GET 20181027_logs/_mapping

{
  "20181027_logs": {
    "mappings": {
      "logs": {
        "properties": {
          […]
          "timestamp": {
            "type": "date"
          }
          […]

While a non-indexed field would give2 :

           "timestamp": {
             "type": "date",
             "index": false
           }

Remaining leads

At this point, I have really no idea what could be the issue.

Just as a side note : The timestamp field is not the insertion date of the event, but the date at which the event actually happened. Regardless of this timestamp, the events are inserted in the latest index. This means that every index can have events corresponding to past dates, but no future dates.

In this precise case, I don't see how this could matter : since we're only querying for the last 15 minutes, the data can only be in the last index no matter what happens.

Elasticsearch and Kibana version : 5.4.3

Thanks for reading this far, and any help would be greatly appreciated !


1 : There's a mistake in index naming, causing an offset between the index name and the actual corresponding date, but it should not matter here.

2 : This was checked on another elastic cluster, of the same version, with some fields explicitly opted out of indexing

Upvotes: 1

Views: 723

Answers (1)

toadjaune
toadjaune

Reputation: 881

TL;DR

I finally solved the issue simply by reducing the number of shards.

Full disclosure

When using the dev tools on kibana, I could find many errors on the _msearch endpoint :

{
  "shard": 2,
  "index": "20180909_logs",
  "node": "FCv8yvbyRhC9EPGLcT_k2w",
  "reason": {
    "type": "es_rejected_execution_exception",
    "reason": "rejected execution of org.elasticsearch.transport.TransportService$7@754fe283 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@16a14433[Running, pool size = 7, active threads = 7, queued tasks = 1000, completed tasks = 16646]]"
  }
},

Which basically proves that I'm flooding my ES server with too many parallel requests on too many shards.

From what I could understand, apparently it's normal for kibana to query against every single index of my index pattern, event if some of them don't contain any fresh data (ES is supposed to query them anyway, and conclude that they don't contain any data in almost no time since the timestamp field is indexed)

From there, I had a few options :

  • 1: Reduce the data retention
  • 2: Reduce the number of parallel requests I am doing
  • 3: Add nodes to my cluster
  • 4: Restructure my data to use fewer shards
  • 5: Increase the size of the search queue

1 and 2 are not an option in my case.

5 would probably work, but is apparently highly recommended against (from what I could understand, in most cases, this error is only the symptom of deeper issues, that should be fixed instead)

This is a 160GB single-node cluster, with (now) more than 350 shards. This makes an extremely low average size per shard, so I decided to first try number 4 : Reindex my data to use fewer shards.

How I dit it

Use a single shard per index :

I created the following index pattern :

PUT _template/logs {
  "template": "*_logs",
  "settings": {
    "number_of_shards": 1
  }
}

Now, all my future indices will have a single shard.

I still need to reindex or merge the existing indices, but this has to be done with the next point anyway.

Switch to monthly indices (instead of daily)

I modified the code that inserts data into ES to use a month-based index name (such as 201901_monthly_logs, and then reindexed every old index to the corresponding one in the new pattern :

POST _reindex
{
  "source": {
    "index": "20181024_logs"
  },
  "dest": {
    "index": "201810_monthly_logs"
  }
}

Enjoy !

This being done, I was down to 7 indices (and 7 shards as well). All that was left was changing the index pattern from _logs to _monthly_logs in my kibana visualisations.

I haven't had any issue since this time, I'll just wait a bit more, then delete my old indices.

Upvotes: 2

Related Questions