Reputation: 1445
I am using from/size pagination to iterate over a large, unsorted query result set while concurrently indexing documents that are not part of the query result set. Ignoring the fact that scroll/scan would be a more efficient solution for my scenario, can I expect consistent results?
I understand that if I were concurrently indexing documents that were part of the result set I should expect duplicate and missing results. In this scenario I am indexing documents that are not part of the result set and I am not sure if the inconsistent results I am getting are expected behavior due to this paging strategy.
I am using elasticsearch version 1.2.2. I have verified that the construction of the queries are consistent with the documentation.
{
"from" : 0, "size" : 50000,
"query" : {
"term" : { "user" : "kimchy" }
}
}
-
{
"from" : 50000, "size" : 50000,
"query" : {
"term" : { "user" : "kimchy" }
}
}
The correct number of documents are always returned (about 2.6 million), most of the time there are a small number of duplicates in place of the correct documents (about 10).
Upvotes: 0
Views: 829
Reputation: 2470
Deep Pagination
can't be expected as save as it's usually executed spanning multiple shards (https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html). So even when not indexing at the same time (which would break your pagination for sure) there are mergings of shards from time to time done in the background. When this happens you may lost a document and got a duplicate instead for it.
So: do scroll/scan.
Upvotes: 1
Reputation: 1445
The issue of inconsistent results can be resolved using scroll/scan pagination instead of from/size pagination.
I do not know for sure if my usage of from/size paging is supported usage but the getting started documentation seems to suggests that it is. This may indicate a bug in from/size paging of version 1.2.2 of elasticsearch, though I have not done the necessary testing to identify or verify this.
Upvotes: 0