Reputation: 486
I am currently playing around with Elasticsearch (ES). We are ingesting sensor data and for 3 years we have approximately 1,000,000,000 documents in one index, making the index about 50GB in size. Indexing performance is not that important as new data only arrives every 15 minutes per sensor on average, therefore I want to focus on searching and aggregating performance. We are running a front-end showing basically a dashboard about average values from last week compared to one year before etc.
I am using ES on AWS and after performance on one machine was quite slow, I spun up a cluster with 3 data nodes (each 2 cores, 8 GB mem), and gave the index 3 primary shards and one replica. Throwing computing power at the data certainly improved the situation and more power would help more, but my question is:
Would splitting the index for example by month increase the performance? Or being more specific: is querying (esp. by date) a smaller index faster if I adjust the queries adequatly, or does ES already 'know' where to find specific dates in a shard?
(I know about other benefits of having smaller indices, like being able to roll over and keep only a specific time interval, etc.)
Upvotes: 3
Views: 4308
Reputation: 7221
1/ Elasticsearch only knows where to find a specific date in an index if your index is sorted by your date field. You can check the documentation here.
In your use case, it can improve drastically search performance. And since all the data will be added at the "end of the index" since its date sorted, you should not see much of indexation overhead.
2/ Without index sort, smaller time-bounded indices will work better (even if you target all your indices) since it will often allow a rewrite or your range query to a match_all / match_none internal query.
For more information about this behavior you should read this blog post :
Instant Aggregations: Rewriting Queries for Fun and Profit
Upvotes: 3