roldugin
roldugin

Reputation: 932

Resource usage with rolling indices in Elasticsearch

My question is mostly based on the following article: https://qbox.io/blog/optimizing-elasticsearch-how-many-shards-per-index

The article advises against having multiple shards per node for two reasons:

The article advocates the use of rolling indices for indices that see many writes and fewer reads.

Questions:

  1. Do the problems of resource consumption by Lucene indices arise if the old indices are left open?
  2. Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?
  3. How does searching many small indices compare to searching one large one?

I should mention that in our particular case, there is only one ES node though of course generally applicable answers will be more useful to SO readers.

Upvotes: 1

Views: 166

Answers (2)

Andrei Stefan
Andrei Stefan

Reputation: 52368

Do the problems of resource consumption by Lucene indices arise if the old indices are left open?

Yes.

Do the problems of contention arise when searching over a large time range involving many indices and hence many shards?

Yes.

How does searching many small indices compare to searching one large one?

When ES searches an index it will pick up one copy of each shard (be it replica or primary) and asks that copy to run the query on its own set of data. Searching a shard will use one thread from the search threadpool the node has (the threadpool is per node). One thread basically means one CPU core. If your node has 8 cores then at any given time the node can search concurrently 8 shards.

Imagine you have 100 shards on that node and your query will want to search all of them. ES will initiate the search and all 100 shards will compete for the 8 cores so some shards will have to wait some amount of time (microseconds, milliseconds etc) to get their share of those 8 cores. Having many shards means less documents on each and, thus, potentially a faster response time from each. But then the node that initiated the request needs to gather all the shards' responses and aggregate the final result. So, the response will be ready when the slowest shard finally responds with its set of results.

On the other hand, if you have a big index with very few shards, there is not so much contention for those CPU cores. But the shards having a lot of work to do individually, it can take more time to return back the individual result.

When choosing the number of shards many aspects need to be considered. But, for some rough guidelines yes, 30GB per shard is a good limit. But this won't work for everyone and for every use case and the article fails to mention that. If, for example, your index is using parent/child relationships those 30GB per shard might be too much and the response time of a single shard can be too slow.


You took this out of the context: "The article advises against having multiple shards per node". No, the article advises one to think about the aspects of structuring the indices shards before hand. One important step here is the testing one. Please, test your data before deciding how many shards you need.

You mentioned in the post "rolling indices", and I assume time-based indices. In this case, one question is about the retention period (for how long you need the data). Based on the answer to this question you can determine how many indices you'll have. Knowing how many indices you'll have gives you the total number of shards you'll have.

Also, with rolling indices, you need to take care of deleting the expired indices. Have a look at Curator for this.

Upvotes: 1

Val
Val

Reputation: 217334

It's very difficult to spit out general best practices and guidelines when it comes to cluster sizing as it depends on so many factors. If you ask five ES experts, you'll get ten different answers.

After several years of tinkering and fiddling around ES, I've found out that what works best for me is always to start small (one node, how many indices your app needs and one shard per index), load a representative data set (ideally your full data set) and load test to death. Your load testing scenarii should represent the real maximum load you're experiencing (or expecting) in your production environment during peak hours.

Increase the capacity of your cluster (add shard, add nodes, tune knobs, etc) until your load test pass and make sure to increase your capacity by a few more percent in order to allow for future growth. You don't want your production to be fine now, you want it to be fine in a year from now. Of course, it will depend on how fast your data will grow and it's very unlikely that you can predict with 100% certainty what will happen in a year from now. For that reason, as soon as my load test pass, if I expect a large exponential data growth, I usually increase the capacity by 50% more percent, knowing that I will have to revisit my cluster topology within a few month or a year.

So to answer your questions:

  1. Yes, if old indices are left open, they will consume resources.
  2. Yes, the more indices you search, the more resources you will need in order to go through every shard of every index. Be careful with aliases spanning many, many rolling indices (especially on a single node)
  3. This is too broad to answer, as it again depends on the amount of data we're talking about and on what kind of query you're sending, whether it uses aggregation, sorting and/or scripting, etc

Upvotes: 1

Related Questions