Elasticsearch: Search Performance of index with large documents (PDF,doc,txt) is slow

Question

I have 65000 document(pdf,docx,txt,..etc) index in elastic-search using mapper-attachment. now I want to search content in that stored document using following query:

"from" : 0, "size" : 50,
"query": {
    "match": {
        "my_attachment.content": req.params.name
     }
}

but it will take 20-30 seconds for results. It is very slow response. so what i have to do for quick response? any idea?

here is mapping:

"my_attachment": {
                  "type": "attachment",
                  "fields": {
                     "content": {
                        "type": "string",
                        "store": true,
                        "term_vector": "with_positions_offsets"
                     }
                 }
}

Andrei Stefan · Accepted Answer

Since your machine has 4 CPUs and the index 5 shards, I'd suggest switching to 4 primary shards, which means you need to reindex. The reason for this approach is that at any given time one execution of the query will use 4 cores. And for one of the shards the query needs to wait. To have an equal distribution of load at query time, use 4 primary shards (=number of CPU cores) so that when you run the query there will not be too much contention at CPU level.

Also, by providing the output of curl localhost:9200/your_documents_index/_stats I saw that the "fetch" part (retrieving the documents from the shards) is taking 4.2 seconds per operation on average. This is likely the result of having very large documents or of retrieving a lot of documents. size: 50 is not a big number, but combined with large documents it will make the query to return the results in a longer time.

The content field (the one with the actual document in it) has store: true and if you want this for highlighting, the documentation says

In order to perform highlighting, the actual content of the field is required. If the field in question is stored (has store set to true in the mapping) it will be used, otherwise, the actual _source will be loaded and the relevant field will be extracted from it.

So if you didn't disable _source for the index, then that will be used and storing the content is not necessary. Also there is no magic for having a faster fetch, it's strictly related to how large your documents are and how many you want to retrieve. Not using store: true might slighly improve the time.

From nodes stats (curl -XGET "http://localhost:9200/_nodes/stats") there was no indication that the node has memory or CPU problems, so everything boils down to my previous suggestions.

Elasticsearch: Search Performance of index with large documents (PDF,doc,txt) is slow

Answers (1)

Related Questions