Is MongoDB producing more than 100% storage overhead? i.e., I insert 22GB data and it occupies 50GB on the disk

Question

I have done a simple experiment to test MongoDB's performance and disk usage. I insert 22GB data but it occupies 50GB on the disk. I will describe this experiment in details as below.

Setup:

Version - MongoDB 2.0.2.
Environment: 1) Single node without any replication or sharding. 2) VM via VirtualBox. 3) Linux Ubuntu 64bit. 4) 100GB fixed virtual disk and 1GB memory
Language: C# && MongoDB C# driver
Target and Procedure: Very simple. I just constantly create a new {KEY, VALUE} pair and insert it into MongoDB.
*Number of Insertion = 1024 * 1024 * 1024 / 3
Size of the KEY = 20 bytes (byte array), a counter with increment of 1 for each insertion, i.e., KEY = {1, 2, 3, ..., 1024*1024*1024}
Size of the VALUE = 100 bytes (byte array), randomly generated through Random class.

Results:

So this experiment means I wished to insert about 40GB of data (120 bytes of data for each insertion) into MongoDB and I believe it is simple enough. However, I stopped when the actual inserted data reached 22GB because I found the storage overhead issue. The actual data I inserted is about 22GB, but all the indexdb.* files are with size of 50GB. So there is more than 100% storage overhead.

My own thoughts:

I have read quite a bit of MongoDB's docs. According to what I have read, there might be two kinds of overhead for the storage.

the oplog. But it is meant to be capped about 5% of disk space. In my case, it is capped about 5GB.
preallocated data file. I didn't change any settings of mongod, so I think it is 2GB in advance. And let me assume that the latest 2GB file in use is nearly empty, so totally at most 4GB overhead.

So from my calculation, whatever size of data I insert, there should be 9GB overhead at most. But now the overhead is 50GB - 22GB = 28GB. And I don't get a clue what is inside that 28GB. And if this overhead is always more than 100%, it is quite a lot.

Can any one please explain it to me?

Here is some mongodb stats I obtained from the mongo shell.

db.serverStatus() {
"host" : "mongodb-VirtualBox",
"version" : "2.0.2",
"process" : "mongod",
"uptime" : 531693,
"uptimeEstimate" : 460787,
"localTime" : ISODate("2012-01-26T16:32:12.888Z"),
"globalLock" : {
     "totalTime" : 531692893756,
     "lockTime" : 454374529354,
     "ratio" : 0.8545807827977436,
     "currentQueue" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     },
     "activeClients" : {
          "total" : 0,
          "readers" : 0,
          "writers" : 0
     }
},
"mem" : {
     "bits" : 64,
     "resident" : 292,
     "virtual" : 98427,
     "supported" : true,
     "mapped" : 49081,
     "mappedWithJournal" : 98162
},
"connections" : {
     "current" : 3,
     "available" : 816
},
"extra_info" : {
     "note" : "fields vary by platform",
     "heap_usage_bytes" : 545216,
     "page_faults" : 14477174
},
"indexCounters" : {
     "btree" : {
          "accesses" : 3808733,
          "hits" : 3808733,
          "misses" : 0,
          "resets" : 0,
          "missRatio" : 0
     }
},
"backgroundFlushing" : {
     "flushes" : 8861,
     "total_ms" : 26121675,
     "average_ms" : 2947.93759169394,
     "last_ms" : 119,
     "last_finished" : ISODate("2012-01-26T16:32:03.825Z")
},
"cursors" : {
     "totalOpen" : 0,
     "clientCursors_size" : 0,
     "timedOut" : 0
},
"network" : {
     "bytesIn" : 44318669115,
     "bytesOut" : 50995599,
     "numRequests" : 201846471
},
"opcounters" : {
     "insert" : 0,
     "query" : 3,
     "update" : 201294849,
     "delete" : 0,
     "getmore" : 0,
     "command" : 551619
},
"asserts" : {
     "regular" : 0,
     "warning" : 0,
     "msg" : 0,
     "user" : 1,
     "rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
     "commits" : 28,
     "journaledMB" : 0,
     "writeToDataFilesMB" : 0,
     "compression" : 0,
     "commitsInWriteLock" : 0,
     "earlyCommits" : 0,
     "timeMs" : {
          "dt" : 3062,
          "prepLogBuffer" : 0,
          "writeToJournal" : 0,
          "writeToDataFiles" : 0,
          "remapPrivateView" : 0
     }
},
"ok" : 1}

db.index.dataSize(): 29791637704

db.index.storageSize(): 33859297120

db.index.totalSize(): 45272200048

db.index.totalIndexSize(): 11412902928

db.runCommand("getCmdLineOpts"): { "argv" : [ "./mongod" ], "parsed" : { }, "ok" : 1 }

My code fragment. I just removed those MongoDB connection codes and keep the cores here.

static void fillupDb()
{
    for (double i = 0; i < 1024 * 1024 * 1024 / 3; i++)
    {
        //Convert the counter i to a 20 bytes of array as KEY
        byte[] prekey = BitConverter.GetBytes(i);
        byte[] key = new byte[20];
        prekey.CopyTo(key, 0);

        // Generate a random 100 bytes of VALUE
        byte[] value = getRandomBytes(100);
        put(key, value);
    }
}

public void put(byte[] key, byte[] value)
{
    BsonDocument pair = new BsonDocument {
        { "_id", key } /* I am using _id as the index */,
        { "value", value }};
    collection.Save(pair);
}

Remon van Vliet · Accepted Answer

Well, first of all. How do you measure the size of your input data? A key-value pair can be two strings or a JSON object.

Additionally, every document has some additional padding added to it to facilitate the document growing through subsequent updates. The average padding factor can be retrieved through db.col.stats().paddingFactor

Finally, there's more than just the oplog that may add to your overhead. There's always an index on _id which in your case (since your document are so small) will introduce significant overhead in terms of disk space usage. Unless you disabled it (--nojournal) the journal will add quite a few of bytes to the total as well.

Hope that helps.

Is MongoDB producing more than 100% storage overhead? i.e., I insert 22GB data and it occupies 50GB on the disk

Setup:

Results:

My own thoughts:

Answers (1)

Related Questions