Reza Sameei
Reza Sameei

Reputation: 838

ElasticSearch - How does sharding affect indexing performance?

I'm doing some benchmarks on a single-node cluster of ElasticSearch.

I faced to the situation that more shards will reduce the indexing performance -at least in a single node- (both in latency and throughput)

These are some of my numbers:

I had the same results with bulk API. So I'm wondering what's the relation and why this happens?

Note: I don't have the resource problem! Resources are free (CPU & Memory)

Upvotes: 10

Views: 5385

Answers (2)

ibexit
ibexit

Reputation: 3667

Just to have you on the same page:

Your data is organized in indices, each made of shards and distributed across multiple nodes. If a new document needs to be indexed, a new id is being generated and the destination shard is being calculated based on this id. After that, the write is delegated to the node, which is holding the calculated destination shard. This will distribute your documents pretty well across all of your shards.

Finding documents by id is now easy, as the shard, containing the wanted document, can be calulated just based on the id. There is no need for searching all shards. BTW, that's the reason why you can't change the number of shards afterwards. The changed shard number will result in a different document distribution across your shards.

Now, just to make it clear, each shard is a separate lucene index, made of segment files located on your disk. When writing, new segments will be created. If a particular number of segment files will be reached, the segments will be merged. So just introducing more shards without distributing them to other nodes will just introduce a higher I/O and memory consumption for your single node. While searching, the query will be executed against each shard. Afterwards the results of all shards needs to be merged into one result - more shards, more cpu work to do...

Coming back to your question:

For your write heavy indexing case, with just one node, the optimal number of indices and shards is 1!

But for the search case (not accessing by id), the optimal number of shards per node is the number of CPUs available. In such a way, searching can be done in multiple threads, resulting in better search performance. Correction: Searching and indexing are multithreaded, a single shard can fully utilize all CPU cores from a node.

But what are the benefits of sharding?

  1. Availability: By replicating the shards to other nodes you can still serve if some of your nodes can´t be reached anymore!

  2. Performance: Distibuting the primary shards to different nodes, will distribute the workload too.

So if your scenario is write heavy, keep the number of shards per index low. If you need better search performance, increase the number of shards, but keep the "physics" in mind. If you need reliability, take the number of nodes/replicas into account.

Further readings:

https://www.elastic.co/guide/en/elasticsearch/reference/current/_basic_concepts.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html

https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html

https://www.elastic.co/de/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

https://thoughts.t37.net/designing-the-perfect-elasticsearch-cluster-the-almost-definitive-guide-e614eabc1a87

Upvotes: 19

user5994461
user5994461

Reputation: 7068

I faced to the situation that more shards will reduce the indexing performance -at least in a single node- (both in latency and throughput)

For reference: Elasticsearch is a distributed database. Data is stored in an "index", the index is split into "shards". Each "shard" is allocated on a node (a different node if possible).

Having more shards allows to use more machines. This is very much how the "distributed" in "distributed database" actually work. Elasticsearch will automatically allocate and move shards in the background, to balance disk usage across all machines.

  • With 1 shards, the data is split onto one node, this gives you a baseline of N reads and M writes per second.

  • With 3 shards, the data is split onto three nodes, this gives you 3 times the throughput.

Of course this assumes that there are 3 machines available. If there is a single machine, then the machine is doing all the processing either way and having more shards has no effect.

There is a bit of overhead with sharding, gotta distribute queries and merge back results, hence doubling the amount of shards will not exactly double performance (expect in the order of +90%).

Your cluster has a single machine. You lose performance when you increase the amount of shards, because it's just increasing the overhead.

P.S. Shards have a replica by default, the replica will take over if the primary is gone (machine failed), this is how resiliency works. An index with 5 shards and 5 replicas can fully utilize 10 nodes. Meaning it takes few shards to use many many nodes.

P.P.S In my experience a configuration of shard=5 is a maximum. You should never set more than that, unless working with large clusters (10+ machines) or terabytes indexes.

Upvotes: 1

Related Questions