mils
mils

Reputation: 1916

mongodb - shard key - compound vs hash

I am working with an existing mongodb collection. The data looks like the following:

{ user_id: 123, post: { id: 123456789, title: "..." } },
{ user_id: 123, post: { id: 123456790, title: "..." } },
{ user_id: 124, post: { id: 123456791, title: "..." } }

I need to shard this collection, and I'm having trouble selecting a shard key. I often perform operations based on a user (e.g. get all posts from user 123). Should I create a shard key based on

{
  user_id: 1,
  post.id: 1
}

or the same, but hashed?

If it is hashed I assume that range-queries will be broadcast to all shards. But if it is not hashed, will documents be evenly distributed across shards? You can see the values increase monotonically.

Thanks,

EDIT: I think I made a mistake, it appears composite indexes cannot be hashed. From the documentation (https://docs.mongodb.com/manual/core/index-compound):

You may not create compound indexes that have hashed index type. You will receive an error if you attempt to create a compound index that includes a hashed

I guess that means that this question is not sensible, so I'll close.

EDIT 2: On second thought, the question is valid, but it would be better phrased like so - I appear to have two options:

  1. Hash the post.id field, which should be unique, and if hashed will help ensure even distribution of data across shards, or

  2. Create a composite key of user_id and post.id, like the code above. This will also guarantee uniqueness, and should help with data locality for a single user. But will it ensure even data distribution across shards?

Thanks

Upvotes: 0

Views: 2402

Answers (1)

Stennie
Stennie

Reputation: 65323

Hash the post.id field, which should be unique, and if hashed will help ensure even distribution of data across shards

If your IDs are monotonic (as per the current examples) I would strongly consider using UUIDs/GUIDs which can be generated without relying on a central sequence. Unless your sequence numbers are being supplied by another system of record, they will introduce a scaling and coordination challenge for distributed clients that need to claim the next available number. GUIDs would more effectively achieve the outcome you are aiming for with hashing.

MongoDB's default ObjectId is one example designed for this purpose: a pseudo-random 12-byte value that can be independently generated in a distributed environment with some approximate ordering based on a leading timestamp.

Generating custom UUIDs is outside the scope of MongoDB, but if you have other requirements (length, range of values, ordering, likelihood of collision, ...) there are many available algorithms/libraries for generating UUIDs or you can create your own formula.

The cardinality of shard key values determines whether you will get effective data distribution. A hashed shard key helps distribute initial writes assuming there is cardinality in the original values: this basically changes a sequence from monotonically increasing to uniform.

Create a composite key of user_id and post.id, like the code above. This will also guarantee uniqueness, and should help with data locality for a single user. But will it ensure even data distribution across shards?

A shard key requires high cardinality, but does not necessarily have to be unique. For example, if you shard on a single field which is {month:1} (representing months of the year) there are only 12 possible values for this field. All data for a single month will end up on a single shard so if there are more values for month 5 than month 11 the data distribution will be inherently uneven. MongoDB's data distribution for a sharded collection is based on being able to automatically divide the shard key into progressively smaller key ranges (called chunks). An underlying assumption is that each chunk represents a roughly equivalent range of data (on average) and that even distribution of chunks across shards will result in balance.

For your use case, {user_id, post.id} seems a likely composite shard key assuming you address the issue of monotonically increasing IDs. This would appear to meet the three aspects I mentioned above.

However, rather than guessing about shard key outcomes I would suggest testing this in a development environment.

If you have good understanding or estimation of your data model and distribution patterns, I would suggest sharding in a test environment using representative data. There are many helpful tools for generating fake (but probabilistic) data if needed. For an example recipe using schema analysis and a "more like this" approach, see: duplicate a collection into itself.

Upvotes: 1

Related Questions