Reputation: 31
After reading some materials about sharding over mongodb cluters, I feel confued about shard keys' function and data migration in the later stage. Assume that we have two shards to store words of a english dictionary. First letter of words is selected sharding key. Assume that words begining with A-C is assigned to shard-A; word begining with D-Z is assigned to shard-B. Obviously,by this way, number of words in Shard-B is much more than that of words in Shard-A. As a result, after a while, some words begining with D-Z will migate from Shard-B to Shard_A for data balance. So, my confusion is that words begining with D-Z occuring in Shard-A is in contradiction with with the rule by sharding key.
Please help me out of the confusion. Thanks in advance.
Upvotes: 3
Views: 971
Reputation: 65333
When you choose a shard key you are determining how a MongoDB cluster can automatically partition data based on observed values.
Your example of sharding on a single letter of the alphabet would be a poor choice because:
You would be limiting your shard key granularity a fixed set of choices (eg. 26 possible choices if the values were uppercase A-Z).
A shard key with low cardinality will indeed lead to ranges that cannot be further subdivided (i.e. your example of a dictionary where there are more words starting with "B" than "X"). In MongoDB terms, shard key ranges are called chunks; those that cannot be further divided will be flagged as jumbo chunks. Jumbo chunks will continue to grow, and the balancer will not attempt to migrate them.
Unless your application use case involves frequently searching by the first letter in the majority of queries, this shard key also wouldn't be effective for targeted queries. Targeted queries are more efficient because mongos
can potentially limit range queries to one or more shards rather than having to send queries to all shards (aka scatter-gather
queries).
Note: you could only choose the "single letter" as a shard key if it was saved as a field present in every document in your sharded collection.
A more typical shard key example would be to use the value of a field that has high cardinality (good uniqueness). In the dictionary example you could perhaps use the dictionary word as the shard key.
Assuming you start by sharding an empty collection, this would conceptually evolve as follows:
The collection starts with a single chunk covering a range with the special values of "MinKey
.. MaxKey
" (aka from minus to plus infinity, or the full range of data).
As documents are added, MongoDB estimates how many documents have been inserted into a given chunk and will automatically split chunks into multiple ranges once there are roughly 64MB of documents in a chunk range.
The chunk ranges will reflect the distribution of the data so in the dictionary example there will be more chunks for ranges of values including B
than there will be for ranges of data including X
. For example, there might be ranges of "bab .. bacon", "baconer .. badger", etc as compared to "waffle .. yak".
Based on migration thresholds, the MongoDB balancer will periodically redistribute chunks between shards as required.
A good shard key will have inherent write distribution that minimises rebalancing effort. You also have to consider how your data arrives, as well. For example, if you are sharding based on words in the english dictionary and inserting the definitions of words in dictionary order then you would end up directing all writes to a single "hot shard" where the current range of values lives. For comparison, if you had a natural distribution of words (eg. as they appear in a newspaper article) the writes would be more distributed.
Assume that words begining with A-C is assigned to shard-A; word begining with D-Z is assigned to shard-A.
By default there is no affinity between shard key ranges and shards. The normal goal is to allow the data to be automatically redistributed as required.
It's possible to set some shard affinity using tag aware sharding, but typically this is done for very specific reasons such as multiple data centre or hot/cold data use cases (see also: Four Ways to Optimize Your Cluster With Tag-Aware Sharding).
Upvotes: 1
Reputation: 7428
The statement that
words begining with A-C occuring in Shard-B is in contradiction with with the rule by sharding key
is not true. Yes, the shared key is always the same, in your case the first letter of the word, however the value of shard key associated with each shard is not hardcoded, and might change during the balancing.
So if initially we had the following picture
+---------+---------+
| Shard 1 | Shard 2 |
+---------+---------+
| A-C | D-Z |
+---------+---------+
and during time, the documents in Shard 2
are becoming majority, the Balancer
will rebalance the data, and correspondingly will reassign the share key values, so you might get another picture:
+---------+---------+
| Shard 1 | Shard 2 |
+---------+---------+
| A-L | M-Z |
+---------+---------+
The point is that you do not query any shard directly and you do not even care how the data is distributed. What you do is query the mongos
router instance, and then all the work is done by MongoDB. Moreover, while querying you might not even be aware of the shard key (although you rather know that to make an efficient query). Instead, MongoDB fetches your query, gets the shard key value (if there is such in your query), finds out in which shard the data might be, and then only queries that particular shard.
So say you are querying the word "Kansas" which initially fell into Shard 2
.
+-----------+
| config db |
+-----------+
↑ |
get shard for | | Shard 2
shard key "K" | |
| ↓
+--------+ query word "Kansas" +--------+ Shard 2 +------------------+
| client | =====================> | mongos | =========> | mongod - Shard 2 |
+--------+ +--------+ +------------------+
So after the balancing you will have another flow
+-----------+
| config db |
+-----------+
↑ |
get shard for | | Shard 1
shard key "K" | |
| ↓
+--------+ query word "Kansas" +--------+ Shard 1 +------------------+
| client | =====================> | mongos | =========> | mongod - Shard 1 |
+--------+ +--------+ +------------------+
But anyway, on your client side you will not notice anything.
Upvotes: 0