SereneAtk
SereneAtk

Reputation: 115

Elastisearch with lots of data and paging

I am using Elasticsearch to query ES data.

When I query for a page after 10000 element I get this exception:

RestStatusException{status=400} org.springframework.data.elasticsearch.RestStatusException: Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]; nested exception is ElasticsearchStatusException[Elasticsearch exception [type=search_phase_execution_exception, reason=all shards failed]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [12050]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]]; nested: ElasticsearchException[Elasticsearch exception [type=illegal_argument_exception, reason=Result window is too large, from + size must be less than or equal to: [10000] but was [12550]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.]];\n\tat deployment.ROOT.war//org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:69)\n\tat deployment.ROOT.wa

I've found this: https://discuss.elastic.co/t/result-window-is-too-large/319979 and it mentions about https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after

I thought I could do something similar but... My sort part of the query is:

 GET /cars/_search/
  {
   "from": 1001,
   "size": 50,
    
      "query" : {
        "match_all" : {}
    },
    "sort": [
    {
      "_score": {
        "order": "desc"
      }
    },
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}

and in the response I have sort block which looks like this:

     {
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : [
      {
        "_index" : "cars",
        "_type" : "_doc",
        "_id" : "1239332",
        "_score" : 1.0,
        "_source" : {
        ...},
        "sort" : [
          1.0,
          1.0
        ]
      }
    ]
  }
}

In their example they show search_after with numbers:

GET twitter/_search
{
    "query": {
        "match": {
            "title": "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"tie_breaker_id": "asc"}
    ]
}

I don't have any fields by which I want to sort, scoring is the only valid way for sorting, so what should I put in sort part of the query ?

In my example 1.0 in sort block in results is something that does not seem to help me. This is not unique value and search_after it will propably fail, because how Elasticsearch would give me proper results, from 1.0 score ? What if all are scored by 1.0 ?

Upvotes: 0

Views: 518

Answers (1)

Val
Val

Reputation: 217464

You can simply use the tie_breaker_id field, which is a copy of the _id field with doc values enabled. That's going to do the job.

"sort": [
    { "_score": "desc" },
    {"tie_breaker_id": "asc"}
]

And if you use PIT (Point in time), another tie breaker field called _shard_doc is added automatically for you. For the record, PIT is useful if you want the result set to be "freezed" while paginating, i.e. immune to new documents coming in while pagination is in progress.

Note: tie_breaker_id is only available since 8.4 onwards. use _shard_doc in earlier versions

Upvotes: 1

Related Questions