Robby
Robby

Reputation: 71

Solandra Sharding: Insider Thoughts

Just got started on Solandra and was trying to understand the 2nd level details of Solandra sharding.

AFAIK Soalndra creates number of shards configured (as "solandra.shards.at.once" property) where each shard is up to size of "solandra.maximum.docs.per.shard".

On the next level it starts creating slots inside each shard which are defined by "solandra.maximum.docs.per.shard"/"solandra.index.id.reserve.size".

What I understood from the datamodel of SchemaInfo CF that inside a particular shard there are slots owned by different physical nodes and these is a race happening between nodes to get these slots.

My questions are:

  1. Does this mean if I request write on a particular solr node eg .....solandra/abc/dataimport?command=full-import does this request gets distributed to all possible nodes etc. Is this distributed write? Because until that happens how would other nodes be competing for slots inside a particular shard.Ideally the code for writing a doc or set of docs would be getting executed on a single physical JVM.

  2. By sharding we tried to write some docs on the single physical node but if it is writing based on the slots which are owned by different physical nodes , what did we actually achieved as we again need to fetch results from different nodes. I understand that the write throughput is maximized.

  3. Can we look into tuning these numbers -? "solandra.maximum.docs.per.shard" , "solandra.index.id.reserve.size","solandra.shards.at.once" .

  4. If I have just one shard and replication factor as 5 in a single DC 6 node setup, I saw that the endpoints of this shard contain 5 endpoints as per the Replication Factor.But what happens to the 6th one. I saw through nodetool that the left 6th node doesn't really get any data. If I increase the replication factor to 6 while keeping the cluster on , will this solve the problem and doing repair etc or is there a better way.

Upvotes: 1

Views: 269

Answers (1)

tjake
tjake

Reputation: 506

Overall the shards.at.once param is used to control parallelism of indexing. the higher that number the more shards are written to at once. If you set it to one you will always to writing to only one shard. Normally this should be set to 20% > the number of nodes in the cluster. so for a four node cluster set it to five.

The higher the reserve size, the less coordination between the nodes is needed. so if you know you have lots of documents to write then raise this.

The higher the docs.per.shard the slower the queries for a given shard will become. In general this should be 1-5M max.

To answer your points:

  1. This will only import from one node. but it will index across many depending on shards at once.

  2. I think the question is should you write across all nodes? Yes.

  3. Yes see above.

  4. If you increase shards.at.once this will be populated quickly

Upvotes: 0

Related Questions