Tomas Greif
Tomas Greif

Reputation: 22661

Fetch all records from elasticsearch

I am trying to fetch and process all entries in elastic using elasticsearch in python. There are approx. 60M records and the issue I have is that when I increase the size above 1M it starts returning nothing.

from elasticsearch import Elasticsearch

es = Elasticsearch("1.1.1.1:1234")

res = es.search(body={
  "from": 0,
  "size": 10000,
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "query": "_exists_:my_string",
            "fields": []
          }
        }
      ],
      "filter": [
        {
          "bool": {
            "must": [
              {
                "range": {
                  "timestamp": {
                    "from": "2019-11-01 01:45:00.000",
                    "to": "2019-11-05 07:45:00.300",
                  }
                }
              }
            ]
          }
        }
      ]
    }
  }
})


print("%d documents found" % res['hits']['total'])

I want to convert the results (basically JSON) to pandas data frame. This works well, but I am struggling how to either fetch all records at once or do this in iterations.

Upvotes: 0

Views: 675

Answers (1)

Archit Saxena
Archit Saxena

Reputation: 1557

Pagination is a very costly process in distributed systems like elasticsearch. There is a limit for the size+offset parameters set to 10,000 by default. To fetch all records for processing, you can use Scroll API.

https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-request-scroll.html

It takes a snapshot in time of the index, and returns a cursor ID which you can keep passing in your subsequent requests, to fetch the next batch.

Upvotes: 1

Related Questions