improbable
improbable

Reputation: 2934

Scroll finalizes successfully without scrolling through all records in ES

I have a scroll based on the excerpt from these docs:

SearchQuery searchQuery = new NativeSearchQueryBuilder()
    .withQuery(matchAllQuery())
    .withIndices(INDEX_NAME)
    .withTypes(TYPE_NAME)
    .withFields("message")
    .withPageable(PageRequest.of(0, 10))
    .build();

Page<SampleEntity> scroll = elasticsearchTemplate.startScroll(1000, searchQuery, SampleEntity.class);

String scrollId = ((ScrolledPage) scroll).getScrollId();
List<SampleEntity> sampleEntities = new ArrayList<>();
while (scroll.hasContent()) {
    sampleEntities.addAll(scroll.getContent());
    scrollId = ((ScrolledPage) scroll).getScrollId();
    scroll = elasticsearchTemplate.continueScroll(scrollId, 1000, SampleEntity.class);
}
elasticsearchTemplate.clearScroll(scrollId);

I need to scroll through a huge dataset(more than 100 mil).

My scroll looks exactly the same(only my query and objects) as this excerpt. But in startScroll and contrinueScroll I pass a different custom document class, which has much less fields than the document, which is used for indexing, but my query has filters so only several fields that match this another document which is used for the scroll are returned.

getTotalElements() scroll method returns correct number of all elements to be fetched.

Scroll loop finishes successfully but it scrolls only through 6% of the dataset.

Upvotes: 0

Views: 769

Answers (1)

Abacus
Abacus

Reputation: 19431

Not a real solution, but looking at your code, your are building a List<SampleEntity> that will eventually contain all your SampleEntitys. If one of these entities uses only 256 bytes, with more than 100mil, that would be t least 25GB memory. How much memory do you have available?

As for the server logs: Here it seems, that Elasticsearch is using up all it's memory as well, according to the garbage collection messages. What have you configured on these machines?

Edit 25.07.2020:

I have set up a test program with the following setup:

  • local Elasticsearch 6.4.2
  • Spring Data Elasticsearch 3.1.0 using transport client

I have created an index with 25 million entries, objects with an Long Id and one uuid as String. This put some strain on ES in regards to garbage collection, but completed.

Reading all entries (not aggregating the results in a list, but counting the returned records) with a matchAll query and a pageable of size 1000 ran without problems.

Using no pageable in the request (resulting in a request size of 10 per scroll request) put a heavy load on Elasticsearch and even more on the client process. But this request finished without problems as well, although running very, very, very slow. This creates heavy garbage collections in both the Elasticsearch and the client program.

So it seems that the code is correct, the problem lies in the page size, the default of 10 is way too less, I ran it successfully with up to 10000 (which is the maximum you can use without increasing index.max_result_window).

So you could try to increase the size of the pageable, but keep in mind, that collecting all these elements puts a heavy load on the memory consumption of your application.

Upvotes: 1

Related Questions