Reputation: 2458
I want to find a proper shard key for a document like this:
{
"_id": "yahoo.com",
"c": {
"en": {
"bdy": ",
"cats": [],
"mDesc": "",
"mHEq": {},
"mKeyw": [],
"mNames": {}
}
},
"cLgth": 566,
"cType": "text/html",
"dTime": 1224,
"jobsDone": [
"rawdataload",
"hrefanalyze",
"metatagsanalyze",
"keywordanalyze",
"categoryfinder"
],
"langs": [
"en", "de"
],
"publishedOn": {
"sims": 1362752738996
},
"tld": "com",
}
My user facing queries are mainly getting a domain by _id out of mongo. Some of them are using the language of the domain The backend queries run different kind of jobs "jobsDone". Based on this information different ranges of documents are selected.
So I thought about just using the "_id" which maps to the domain name as it has very high cardinality. Would it make sense to use an MD5 hash of the domain name to distribute it more evenly?
I'm not so about "Query isolation". As most user queries will just read directly for _id it is fine I think. The jobs backend queries could be longer running (scatter/gather) as the user is not seeing it but I thought to optimize this I add the "jobsDone" field as a compound shard key to distribute the it by the jobs which run already?
Is it possible to use an array as a shard key?
Thanks for all the insights!
Upvotes: 0
Views: 192
Reputation: 2699
Shard keys cannot be arrays, since an index on a shard key cannot be a multikey index. I certainly think that you will want "_id" (domain) to be part of your shard key, and if you can find another way to ensure query isolation, then this will help.
I'm a little uncertain why you're worried about the domain names distributing evenly, since domain names tend to be pretty random, and if you are expecting to have a very large number of different domains, you should be in good shape. If for some reason domain name distribution becomes a problem, you could run MongoDB 2.4.1 and use a hashed shard key.
Upvotes: 2