Elastic Search Scroll Behaviour

Question

I came across scroll functionality in Elastic Search and this is looking pretty interesting. I gone through so many documents but still below questions are not clear to me.

If offset is already there then why to use scroll?
What about upcoming records? Suppose it finished to scroll all data and then after few seconds new data came into the index, then how it will work? will it scroll to get new records also, like streaming?
Suppose connection is broken because of server load or internet issue, then will it start scrolling data from starting?

All these questions are in context of re-indexing data from old index to new index.

pandaadb · Accepted Answer

I will try and give some info on this as I too have recently done some research into that:

If offset is already there then why to use scroll?

I am not sure if you can use scroll in combination with offsets. But I believe the main difference would be that an offset query will give you "false" results. False in terms of it will execute your query correctly, however consider all updates in between. In terms of reindexing, this would be wrong as you are at risk to loose data. Imagine you do an offset query of 10k results, and then taking 2 minutes to process it. You might have updates to your objects (or inserts) within the 2 minutes. That means that offsetting your query by 10k might end up pointing to a result skipping a few rows in between, or to a result that already has been there (imagine deletion in between). Scroll however guarantees to keep the search context alive and return results in a clear and strict way, where no updates will be considered.

I think the required behaviour could be implemented with a constant sorting + a search after as documented here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-after.html This should make the results stable (in turn of the cursor pointing to the offset being correct) however it would still consider all changes (I think) that happen between 2 requests.

I would imagine re-indexing would happen by changing your config (say logstash) to start inserting the correct documents into the new index, and then doing a scroll over ALL old data to reindex it into the new index. By using scroll, you would be able to still work with that old data while the changes would not affect your reindex operation.

Docs:

While a search request returns a single “page” of results, the scroll API can be used to retrieve large numbers of results (or even all results) from a single search request, in much the same way as you would use a cursor on a traditional database.

Up next:

What about upcoming records? Suppose it finished to scroll all data and then after few seconds new data came into the index, then how it will work? will it scroll to get new records also, like streaming?

Scrolling will preserve the result it created on the first scroll request. This is done by taking a snapshot and preventing changes to be published to the specific scroll. Docs:

The results that are returned from a scroll request reflect the state of the index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

And third:

Suppose connection is broken because of server load or internet issue, then will it start scrolling data from starting?

This does not matter. Scroll comes with an assigment, e.g. POST /twitter/tweet/_search?scroll=1m where the assignment, 1m, indicates to elasticsearch how long the search context is kept alive withing the ES server. This means, if your connection breaks, all you need to do is to pick up your scroll id and use this to create a new request. ES will match that id to the existing search context and give you the expected results. Docs:

In order to use scrolling, the initial search request should specify the scroll parameter in the query string, which tells Elasticsearch how long it should keep the “search context” alive (see Keeping the search context alive), eg ?scroll=1m.

Generally, all that information can be found here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html

Hope this helps,

Artur

Elastic Search Scroll Behaviour

Answers (1)

Related Questions