Reputation: 1246
I am developing a web application where users will be uploading a large number of documents to the system and different types of operations will be performed on the documents, including aggregation. However the number of documents uploaded by each user varies widely - some might upload a dozen documents, and some might upload a million documents.
documents look something like this:
doc{
_id: <self generated UUID>,
uid: <id of user who uploaded the document>,
ctime: <creation timestamp>,
....
<other attributes, etc>
....
}
Now here is the problem in choosing the shard key:
1. If I choose the UUID as the shard key, documents uploaded by the same user are unlikely to end up in the same shard and aggregation operations will be costly.
2. If I use uid as the shard key then the data stored in shards will not be even.
Can anyone suggest which is the best way to achieve this?
I am very new to partitioning and sharding and my research on google as well as stack-overflow did not yield anything. I can change the schema of the documents if needed since the project is still at the design phase.
Upvotes: 1
Views: 1715
Reputation: 1339
You can read more on shardkey selection and scaling
1] Kristina Chodrow's book "Scaling MongoDB" http://shop.oreilly.com/product/0636920018308.do
2]Antoine Girbal's presentation on Sharding Best Practices http://www.10gen.com/presentations/MongoNYC-2012/Sharding-Best-Practices-Advanced
Upvotes: 1
Reputation: 33155
This is the best guide I've seen on choosing a shard key: http://www.kchodorow.com/blog/2011/01/04/how-to-choose-a-shard-key-the-card-game/
You have to decide how you want to query the data. Perhaps a combination of uid and ctime will yield a good shard key, but I'm not sure if that will cause you grief while querying, as you haven't given much insight on how you plan to query.
Upvotes: 3