Reputation: 301
I have a cluster of 10 nodes where I index about a 100 million records daily. Total close to 6 billion records. I am constantly loading data. Each record has about 75 fields associated with it. 99% of my queries are based on the same field query. Essentially select * from table where groupid = 'value'. The majority of the queries returning bring back about a hundred records.
My queries currently take about 30 seconds to run the first 2 times and then are in the milliseconds. The problem is that all the user queries are searching for a different groupID so there queries are going to be slow for the most part until they run it the third time.
Is it possible to "cache" the groupid field so that I can get sub second queries.
My current query looks like this. (Psuedo-query) (I'm using non-analyzed field which I believe is better?)
query : {
filtered : {
filter : {
"term" : { groupID : "valuex" }
}
}
}
I"ve researched and not sure how to go about this. I've looked into doc_values = yes and possibly field cache?
I do not care about scoring, aggregates. My only use case is to filter out records and only bringing back the 100 or so out of 5 billion that have the correct groupID.
We have about 64G Memory on each server.
Just looking for help on how to achieve optimal performance/caching? or anything else that would help.
I thought about routing but this would be difficult based on our groupid values.
thanks
Upvotes: 0
Views: 105
Reputation: 14512
Starting from elasticsearch 2.0 we did some caching changes, like:
Wondering if you are hitting this last one. Note that we did that because the File System cache might be probably better than internal caching.
Could you try with a bool query instead of a filtered query BTW? Filtered has been deprecated (and is removed in 5.0). And see how it performs?
Upvotes: 1