Reputation: 2489
I am trying to find the size of all chunks in one of my sharding collection.
I'd like to know the real size, not the hint given to the mongos as a setting which I know I can find with :
use config
db.settings.find({_id : "chunksize"})
I have tried several solutions but the fact that count operation is very slow so this is not easy. Do you know a solution ? (shell, csharp, python, ruby, bash, I don't care)
For now I have tested the following :
db.getSisterDB("config").chunks.find({ns : "mydb.mycollection"}).forEach(function(chunk) {
db.getSisterDB("mydb").mycollection.find({},{_id : 0, partnerId , 1, id : 1}).min(chunk.min).max(chunk.max).count()
})
but this is too slow, I am under the impression that it does not use the index on my shard key (which is on {partnerId : 1, id : 1}
).
I have also replaced count by explain without any luck. I have also replaced the count with a javascript forEach to manually count (trying to have a indexOnly query that would not hit disk).
I am trying to find the real size because I have seen several chunks that are far above the chunksize given as a hint (2Gb instead of 64Mb).
Upvotes: 8
Views: 3897
Reputation: 2489
After some tries, there is no easier way than using a count in version <2.2 The following is the script I use with my compound shard key (partnerId, id).
var collection = "products";
var database = "products";
var ns =database+"."+collection;
rs.slaveOk(true)
db.getSiblingDB("config").chunks.find({ns : ns}).forEach(function(chunk) {
pMin = chunk.min.partnerId
pMax = chunk.max.partnerId
midR = {partnerId : {$gt : pMin , $lt : pMax}}
lowR = {partnerId : pMin, id : {$gte : chunk.min.id}}
if (pMin == pMax) lowR.id = {$gte : chunk.min.id, $lt : chunk.max.id}
upR = {partnerId : pMax, id : {$lt : chunk.max.id}}
a = db.getSiblingDB(database).runCommand({count : collection, query : lowR, fields : {partnerId :1, _id : 0}}).n
b = db.getSiblingDB(database).runCommand({count : collection, query : midR, fields : {partnerId :1, _id : 0}}).n
c=0
if (pMin != pMax)
c = db.getSiblingDB(database).runCommand({count : collection, query : upR, fields : {partnerId :1, _id : 0}}).n
print(chunk.shard + "|"+tojson(chunk.min) +"|" +tojson(chunk.max)+"|"+a +"|"+b+"|"+ c +"|"+(a+b+c))
})
Upvotes: 1
Reputation: 2743
I think the command that would help you out the most is the datasize
command. There is still a caveat here that the command will take longer to run in larger sized collections, so your mileage may vary.
Given that, you could try something similar to the following:
var ns = "mydb.mycollection" //the full namespace of the collection
var key = {partnerId : 1, id : 1} //the shard key of the collection
db.getSiblingDB("config").chunks.find({ns : ns}).forEach(function(chunk) {
var ds = db.getSiblingDB(ns.split(".")[0]).runCommand({datasize:chunk.ns,keyPattern:key,min:chunk.min,max:chunk.max});
print("Chunk: "+chunk._id +" has a size of "+ds.size+", and includes "+ds.numObjects+" objects (took "+ds.millis+"ms)")
}
)
Upvotes: 9