Reputation: 965
Example query:
GET hostname:port /myIndex/_search {
"size": 10000,
"query": {
"term": { "field": "myField" }
}
}
I have been using the size option knowing that:
index.max_result_window = 100000
But if my query has the size of 650,000 Documents for example or even more, how can I retrieve all of the results in one GET?
I have been reading about the SCROLL, FROM-TO, and the PAGINATION API, but all of them never deliver more than 10K.
This is the example from Elasticsearch Forum, that I have been using:
GET /_search?scroll=1m
Can anybody provide an example where you can retrieve all the documents for a GET search query?
Upvotes: 84
Views: 212580
Reputation: 367
For people still asking this question. You need to increase the max_result_window in your index setting, which has by default 10k.
curl -X PUT "http://localhost:9200/index_name/_settings" -H "Content-Type: application/json" -d '{
"index" : {
"max_result_window" : 50000
}
}'
Just make sure you have enough memory to handle results when you get them.
Upvotes: 5
Reputation: 61
By default only a lower bound of results is shown, since that is sufficient for many searches. To explicitly tell ElasticSearch to accurately count (and therefore visit) every entry you can add "track_total_hits": true
to your query. Keep in mind that this is more expensive than the default behavior.
Your query would than look like this:
GET hostname:port /myIndex/_search {
"size": 10000,
"track_total_hits": true,
"query": {
"term": { "field": "myField" }
}
}
Upvotes: 2
Reputation: 2821
PUT indexname/_settings
{
"index.max_result_window": 30000 # example of 30000 documents
}
from
and size
:res = elastic_client.search(index=index_bu, request_timeout=10,
body={
"from": 0, # get from number of document
"size": 15000, # how much documents
"query": {"match_all": {}}
})
"from": 15000, "size": 15000
Upvotes: 2
Reputation: 625
You can use scroll
to retrieve more than 10000 records. Below is the Python function example to achieve scroll.
self._elkUrl = "http://Hostname:9200/logstash-*/_search?scroll=1m"
self._scrollUrl="http://Hostname:9200/_search/scroll"
"""
Function to get the data from ELK through scrolling mechanism
"""
import logging
import pandas as pd
import requests
import sys
def GetDataFromELK(self):
# implementing scroll and retrieving data from elk to get more than 100000 records at one search
# ref :https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
try:
dataFrame = pd.DataFrame()
if self._elkUrl is None:
raise ValueError("_elkUrl is missing")
if self._username is None:
raise ValueError("_userNmae for elk is missing")
if self._password is None:
raise ValueError("_password for elk is missing")
response = requests.post(self._elkUrl, json=self.body,
auth=(self._username, self._password))
response = response.json()
if response is None:
raise ValueError("response is missing")
sid = response['_scroll_id']
hits = response['hits']
total = hits["total"]
if total is None:
raise ValueError("total hits from ELK is none")
total_val = int(total['value'])
url = self._scrollUrl
if url is None:
raise ValueError("scroll url is missing")
# start scrolling
while (total_val > 0):
# keep search context alive for 2m
scroll = '2m'
scroll_query = {"scroll": scroll, "scroll_id": sid}
response1 = requests.post(url, json=scroll_query,
auth=(self._username, self._password))
response1 = response1.json()
# The result from the above request includes a scroll_id, which should be passed to the scroll API in order to retrieve the next batch of results
sid = response1['_scroll_id']
hits = response1['hits']
data = response1['hits']['hits']
if len(data) > 0:
cleanDataFrame = self.DataClean(data)
dataFrame = dataFrame.append(cleanDataFrame)
total_val = len(response1['hits']['hits'])
num = len(dataFrame)
print('Total records recieved from ELK=', num)
return dataFrame
except Exception as e:
logging.error('Error while getting the data from elk', exc_info=e)
sys.exit()
Upvotes: 3
Reputation: 373
Scroll API has its own limitation. Recently elastic introduce a new functionality (Point in Time).
Basically it take a snapshot of index at that time and then you can use search_after to retrieve result beyond 10000.
Upvotes: 1
Reputation: 2480
nodeJS scroll example using elascticsearch:
const elasticsearch = require('elasticsearch');
const elasticSearchClient = new elasticsearch.Client({ host: 'esURL' });
async function getAllData(query) {
const result = await elasticSearchClient.search({
index: '*',
scroll: '10m',
size: 10000,
body: query,
});
const retriever = async ({
data,
total,
scrollId,
}) => {
if (data.length >= total) {
return data;
}
const result = await elasticSearchClient.scroll({
scroll: '10m',
scroll_id: scrollId,
});
data = [...data, ...result.hits.hits];
return retriever({
total,
scrollId: result._scroll_id,
data,
});
};
return retriever({
total: result.hits.total,
scrollId: result._scroll_id,
data: result.hits.hits,
});
}
Upvotes: 16
Reputation: 12996
PUT _settings
{
"index.max_result_window": 500000
}
Upvotes: 3
Reputation: 217274
Scroll is the way to go if you want to retrieve a high number of documents, high in the sense that it's way over the 10000 default limit, which can be raised.
The first request needs to specify the query you want to make and the scroll
parameter with duration before the search context times out (1 minute in the example below)
POST /index/type/_search?scroll=1m
{
"size": 1000,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
In the response to that first call, you get a _scroll_id
that you need to use to make the second call:
POST /_search/scroll
{
"scroll" : "1m",
"scroll_id" : "DXF1ZXJ5QW5kRmV0Y2gBAAAAAAAAAD4WYm9laVYtZndUQlNsdDcwakFMNjU1QQ=="
}
In each subsequent response, you'll get a new _scroll_id
that you need to use for the next call until you've retrieved the amount of documents you need.
So in pseudo code it looks somewhat like this:
# first request
response = request('POST /index/type/_search?scroll=1m')
docs = [ response.hits ]
scroll_id = response._scroll_id
# subsequent requests
while (true) {
response = request('POST /_search/scroll', scroll_id)
docs.push(response.hits)
scroll_id = response._scroll_id
}
UPDATE:
Please refer to the following answer which is more accurate regarding the best solution for deep pagination: Elastic Search - Scroll behavior
Upvotes: 76
Reputation: 513
Another option is the search_after Tag. Joined with a sorting mechanism, you can save your last element in the first return and then ask for results coming after that last element.
GET twitter/_search
{
"size": 10,
"query": {
"match" : {
"title" : "elasticsearch"
}
},
"search_after": [1463538857, "654323"],
"sort": [
{"date": "asc"},
{"_id": "desc"}
]
}
Worked for me. But until now getting more than 10.000 documents is really not easy.
Upvotes: 7
Reputation: 929
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-from-size.html
So You'll have TWO approches here:
1.add the your query the "track_total_hits": true variable.
GET index/_search
{
"size":1,
"track_total_hits": true
}
2.Use the Scroll API, but then you can't do the from,size in the ordinary way and you'll have to use the Scroll API.
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-scroll.html
for example:
POST /twitter/_search?scroll=1m
{
"size": 100,
"query": {
"match" : {
"title" : "elasticsearch"
}
}
}
Upvotes: 57
Reputation: 13
For Node.js, starting in ElasticSeach v7.7.0, there is now a scroll helper!
Documentation here: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/7.x/client-helpers.html#_scroll_documents_helper
Otherwise, the main docs for the Scroll API have a good example to work off of: https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/scroll_examples.html
Upvotes: 0
Reputation: 61
here you go:
GET /_search
{
"size": "10000",
"query": {
"match_all": {"boost" : "1.0" }
}
}
But we should mostly avoid this approach to retrieve huge amount of docs at once as it can increase data usage and overhead.
Upvotes: 0
Reputation: 2290
When there are more than 10000 results, the only way to get the rest is to split your query to multiple, more refined queries with more strict filters, such that each query returns less than 10000 results. And then combine the query results to obtain your complete target result set.
This limitation to 10000 results applies to web services that are backed by ElasticSearch index, and there’s just no way around it, the web service would have to be reimplemented without using ElasticSearch.
Upvotes: 0
Reputation: 601
I can suggest a better way to do this. I guess you're trying to get more than 10,000 records. Try the below way and you will get millions of records as well.
Define your client.
client = Elasticsearch(['http://localhost:9200'])
search = Search(using=client)
Check total number of hits.
results = search.execute()
results.hits.total
s = Search(using=client)
Write down your query.
s = s.query(..write your query here...)
Dump the data into a data frame with scan. Scan will dump all the data into your data frame even if it's in billions, so be careful.
results_df = pd.DataFrame((d.to_dict() for d in s.scan()))
Have a look at your data frame.
results_df
If you're getting an error with search function, then do below:
from elasticsearch_dsl import Search
Upvotes: 1
Reputation: 601
Look at search_after documentation
Example query as hash in Ruby:
query = {
size: query_size,
query: {
multi_match: {
query: "black",
fields: [ "description", "title", "information", "params" ]
}
},
search_after: [after],
sort: [ {id: "asc"} ]
}
Upvotes: 0