Reputation: 22661
I am trying to fetch and process all entries in elastic using elasticsearch in python. There are approx. 60M records and the issue I have is that when I increase the size above 1M it starts returning nothing.
from elasticsearch import Elasticsearch
es = Elasticsearch("1.1.1.1:1234")
res = es.search(body={
"from": 0,
"size": 10000,
"query": {
"bool": {
"must": [
{
"query_string": {
"query": "_exists_:my_string",
"fields": []
}
}
],
"filter": [
{
"bool": {
"must": [
{
"range": {
"timestamp": {
"from": "2019-11-01 01:45:00.000",
"to": "2019-11-05 07:45:00.300",
}
}
}
]
}
}
]
}
}
})
print("%d documents found" % res['hits']['total'])
I want to convert the results (basically JSON) to pandas data frame. This works well, but I am struggling how to either fetch all records at once or do this in iterations.
Upvotes: 0
Views: 675
Reputation: 1557
Pagination is a very costly process in distributed systems like elasticsearch. There is a limit for the size+offset parameters set to 10,000 by default. To fetch all records for processing, you can use Scroll API.
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-request-scroll.html
It takes a snapshot in time of the index, and returns a cursor ID which you can keep passing in your subsequent requests, to fetch the next batch.
Upvotes: 1