ajgustafsson
ajgustafsson

Reputation: 41

Elasticsearch search query returns different amount of documents

Some background on the elasticsearch instance:

I want to return all documents that has a specific name. The attribute name is mapped:

"name": {
                    "type": "string",
                    "index": "not_analyzed"
                } 

I have tried using different type of search; filter, query_string, term. All with the same result. The current query looks like this:

    {   "query": {
            "query_string": {
                "default_field" : "name",
                "query": "test_run_435_tc"
            }
        },
        "size" : 10000000
    }

The problem is that the query does not return the right amount of documents at the first try. I know for a fact that there exists about 45000 documents with the name "test_run_435_tc" in the index.

But when the query is run for the first time it returns around 5000 documents. If I repeat the query directly after each other, the number of returned documents are increasing. After about 3-4 queries run, I get the right amount of documents in the result.

I am using elasticsearch-py as client.

It seems like elasticsearch is warming up and after a few runs of the same query, elastic returns the correct amount of documents..

Why is elasticsearch behaving like this? It is a normal behaviour for elasticsearch or am I missing something? Of course I would like to get the correct result on the first try..

Updates based on comments:

The "size" : 10000000 originates from when I was not aware of how many documents with the same name that were in the index.

When setting "size" : 0 and executing the query, this is the response:

 {u'_shards': {u'failed': 0, u'successful': 4, u'total': 4},
  u'hits': {u'hits': [], u'max_score': 0.0, u'total': 28754},
  u'timed_out': True,
  u'took': 130}

When runnning the same query again with "size" : 0, this is the response:

 {u'_shards': {u'failed': 0, u'successful': 4, u'total': 4},
  u'hits': {u'hits': [], u'max_score': 0.0, u'total': 39223},
  u'timed_out': True,
  u'took': 134}

Running the same query as above with "size": 0, but with the these parameters .....?timeout=100000&search_type=count returns this response:

{
"took": 525,
"timed_out": false,
"_shards": {
    "total": 4,
    "successful": 4,
    "failed": 0
},
"hits": {
    "total": 49501,
    "max_score": 0,
    "hits": []
}
}

The response above which returned 49501 "hits_total", actually gives the correct number of hits in the first try!

Upvotes: 4

Views: 2860

Answers (1)

Prabin Meitei
Prabin Meitei

Reputation: 2000

One thing is sure from the output is that your query is getting timed out. This can be caused by various reasons. I haven't used python client, you need to check whether your client is setting any global timeout somewhere while making connections etc.

First also check how much time your original query takes(remove the search_type param) retain the timeout param.

As @moliware suggested convert your query into term query for better performance and check the time again.

These two activity will give you an idea about how much your query time is.

You also need to understand your requirement as to whether you need only count or you need the documents also. Search_type count is relatively faster and should be used if you are interested only in count.

I hope you won't find a use case where you need 100000 documents in one go. You will want to paginate even if you want to display.

Lastly with the size of your documents and the size of hardware at your disposal I am surprised that you have only one node and 30gb ram. If you are free to use the resources you should consider making more nodes in the same server. Limiting ram to less than 32gb is a good idea to be able to use compressed pointers of java. But as you have 256gb(humongous) ram you can start more nodes and take benefit of the resources.

With multiple nodes you can retry the queries and check for the result.

Upvotes: 0

Related Questions