ElasticSearch Scale Forever

Question

enter image description here

ElasticSearch Community: Suppose I have a customer named Twetter who has hired me today to build out their search capability for a 181 word social media site.

Assume I cannot predict the number of shards I will need for future scaling and the storage size is already in tens of terabytes.

Assume I do not need to edit any documents once they are indexed. This is strictly for searching.

Referencing the image above, there seems to be some documents which point to 'rolling indexes' ref1 ref2 ref3 whereby I may create a single index (ea. index named tweets1 -> N) on-the-fly. When one index fills up, I can simply add a new machine, with a new index, and add it to the same cluster and alias for searching.

Does this architecture hold water in production?

Are there any long term ramifications to this 'rolling index' architecture as opposed to predicting a shard count and scaling within that estimate?

Justin Warkentin · Accepted Answer

A shard in elasticsearch is just a lucene index. An elasticsearch index is just a collection of lucene indices (shards). Given that, for capacity planning in your situation you simply need to figure out how many documents you can store in an index with only one shard and still get the query performance you want.

It is the underlying lucene indices that use up resources. Based on how your documents are indexed within the lucene indices, there is a finite number of shards that any single node in your cluster will be able to handle. You can always scale by adding more nodes to the cluster. Just monitor resource usage and query response times to know when to add more nodes.

It is perfectly reasonable to create indices named tweet_1, tweet_2, tweet_3, etc. rolling forward instead of worrying about resharding your data. It accomplishes the same thing in the end. Just use an index alias to hide the numbers.

Once you figure out how many documents you can store per shard to get your query performance, then decide how many shards per index you want to have and then multiply those numbers and cap the index at that number of documents in your code. Once you reach the cap you just roll over to a new index. Here is what I do in my code to determine which index to send a document to (I have sequential ids):

$index = 'file_' . (int)($fid / $docsPerIndex);

Note that I am using index templates so it can automatically create a new index without me having to manually roll over when the cap is reached.

One other consideration is what type of queries you will be performing. As the data grows you have two options for scaling.

You need to have enough nodes in your cluster for parallelizing the query that it can easily search across all indices and still respond quickly.

or

You need to name your indices such that you know which to query and only need to query a subset of the indices in the cluster.

Keep in mind that if you have sequential or predictable ids then elasticsearch can perform id based queries efficiently without actually having to query the whole cluster. If you let ES automatically assign ids (assuming you are using ES >=1.4.0) it will use predictable ids (flake ids) already. This also speeds up indexing. Random ids create a worst case scenario.

If your queries are going to be time based then it will have to search the entire set of indices for each query under this scheme. For time based queries you want to roll your indices over based on some amount of time (e.g. each day or month depending on how much data you receive in that time frame) and name them something like tweets_2015_01, tweets_2015_02, etc. By doing so you can narrow the set of indices you have to search at query time based on the requested search time range.

ElasticSearch Scale Forever

Answers (1)

Related Questions