kamaradclimber
kamaradclimber

Reputation: 2489

Know real chunk sizes in mongodb

I am trying to find the size of all chunks in one of my sharding collection.

I'd like to know the real size, not the hint given to the mongos as a setting which I know I can find with :

use config
db.settings.find({_id : "chunksize"})

I have tried several solutions but the fact that count operation is very slow so this is not easy. Do you know a solution ? (shell, csharp, python, ruby, bash, I don't care)

For now I have tested the following :

db.getSisterDB("config").chunks.find({ns : "mydb.mycollection"}).forEach(function(chunk) {
     db.getSisterDB("mydb").mycollection.find({},{_id : 0, partnerId , 1, id : 1}).min(chunk.min).max(chunk.max).count()
})

but this is too slow, I am under the impression that it does not use the index on my shard key (which is on {partnerId : 1, id : 1}).

I have also replaced count by explain without any luck. I have also replaced the count with a javascript forEach to manually count (trying to have a indexOnly query that would not hit disk).

I am trying to find the real size because I have seen several chunks that are far above the chunksize given as a hint (2Gb instead of 64Mb).

Upvotes: 8

Views: 3897

Answers (2)

kamaradclimber
kamaradclimber

Reputation: 2489

After some tries, there is no easier way than using a count in version <2.2 The following is the script I use with my compound shard key (partnerId, id).

var collection = "products";
var database = "products";
var ns =database+"."+collection;
rs.slaveOk(true)
db.getSiblingDB("config").chunks.find({ns : ns}).forEach(function(chunk) {
  pMin = chunk.min.partnerId
  pMax = chunk.max.partnerId
  midR = {partnerId : {$gt : pMin , $lt : pMax}}
  lowR = {partnerId  : pMin,  id : {$gte : chunk.min.id}}
  if (pMin == pMax) lowR.id = {$gte : chunk.min.id, $lt : chunk.max.id}
  upR  = {partnerId  : pMax,  id : {$lt : chunk.max.id}}
  a = db.getSiblingDB(database).runCommand({count : collection, query : lowR, fields :    {partnerId :1, _id : 0}}).n 
  b = db.getSiblingDB(database).runCommand({count : collection, query : midR, fields :    {partnerId :1, _id : 0}}).n 
  c=0
  if (pMin != pMax)
    c = db.getSiblingDB(database).runCommand({count : collection, query : upR, fields :    {partnerId :1, _id : 0}}).n 
  print(chunk.shard + "|"+tojson(chunk.min) +"|" +tojson(chunk.max)+"|"+a +"|"+b+"|"+ c     +"|"+(a+b+c))
  })

Upvotes: 1

Andre de Frere
Andre de Frere

Reputation: 2743

I think the command that would help you out the most is the datasize command. There is still a caveat here that the command will take longer to run in larger sized collections, so your mileage may vary.

Given that, you could try something similar to the following:

var ns = "mydb.mycollection" //the full namespace of the collection
var key = {partnerId : 1, id : 1} //the shard key of the collection

db.getSiblingDB("config").chunks.find({ns : ns}).forEach(function(chunk) {
        var ds = db.getSiblingDB(ns.split(".")[0]).runCommand({datasize:chunk.ns,keyPattern:key,min:chunk.min,max:chunk.max});
        print("Chunk: "+chunk._id +" has a size of "+ds.size+", and includes "+ds.numObjects+" objects (took "+ds.millis+"ms)")
    }
)

Upvotes: 9

Related Questions