Tom1999
Tom1999

Reputation: 193

Result window is too large

My friend stored 65000 documents on the Elastic Search cloud and I would like to retrieve all of them (using python). However, when I am running my current script, there is an error noticing that :

RequestError(400, 'search_phase_execution_exception', 'Result window is too large, from + size must be less than or equal to: [10000] but was [30000]. See the scroll api for a more efficient way to request large data sets. This limit can be set by changing the [index.max_result_window] index level setting.') 

My script

 es = Elasticsearch(cloud_id=cloud_id, http_auth=(username, password))
 docs =  es.search(body={"query": {"match_all": {}}, '_source': ["_id"], 'size': 65000})

What would be the easiest way to retrieve all those document and not limit it to 10000 docs? thanks

Upvotes: 3

Views: 9133

Answers (4)

Sebastian
Sebastian

Reputation: 370

As others said, this limit protects your nodes from large result sets. The standard way to do this now (in 2024) is to use search_after.

Here's an example using Python and elasticsearch_dsl:

def hit_generator(index, chunk_size=5000):
    i = 0
    search_after_id = None
    while True:
        print(f'Aggregating next {chunk_size} documents, aggregated {i*chunk_size} so far...')
        s = Search(using=client, index=index)
        s = s.extra(size=chunk_size)
        s = s.sort('_id')
        if search_after_id:
            s = s.extra(**{'search_after': [search_after_id], 'size': chunk_size})

        response = s.execute()
        if len(response) == 0:
            print(f'No more results to return for index {index}, scanned <{i*chunk_size} documents')
            break
        for hit in response:
            search_after_id = hit.meta.id
            yield hit
        i += 1

# How to use it?
for hit in hit_generator('my_index'):
    print(f'Got hit: {hit}')

This way you'll only read next 5000 documents at a time, starting from the one where you 'finished' in the last request.

It will execute (number of documents)/(chunk_size) searches

Upvotes: 1

Amit
Amit

Reputation: 32386

The error message itself is mentioning that how can you solve the issue, look carefully this part of the error message.

This limit can be set by changing the [index.max_result_window] index level setting.

Please refer update indices level setting on how to change that.

So for your setting it would look like:

PUT /<your-index-name>/_settings
{
    "index" : {
        "index.max_result_window" : 65000 -> note its equal to your all the docs in your index
    }
}

Upvotes: 2

pkp9999
pkp9999

Reputation: 159

The limit has been set so that the result set does not overwhelm your nodes. Results will occupy memory in the elastic node. So bigger the result set, bigger the memory footprint and impact on the nodes.

Depending on what you want to do with the retrieved documents,

Upvotes: 4

sgalinma
sgalinma

Reputation: 200

You should use the scroll API and get the results in different calls. The scroll API will return to you the results 10000 by 10000 as maximum (that will be available to consult during the amount of time you indicate in the call) and you will be able then to paginate the results and obtain them thanks to a scroll_id.

Upvotes: 2

Related Questions