Reputation: 3943
I have done a simple experiment to test MongoDB's performance and disk usage. I insert 22GB data but it occupies 50GB on the disk. I will describe this experiment in details as below.
So this experiment means I wished to insert about 40GB of data (120 bytes of data for each insertion) into MongoDB and I believe it is simple enough. However, I stopped when the actual inserted data reached 22GB because I found the storage overhead issue. The actual data I inserted is about 22GB, but all the indexdb.* files are with size of 50GB. So there is more than 100% storage overhead.
I have read quite a bit of MongoDB's docs. According to what I have read, there might be two kinds of overhead for the storage.
So from my calculation, whatever size of data I insert, there should be 9GB overhead at most. But now the overhead is 50GB - 22GB = 28GB. And I don't get a clue what is inside that 28GB. And if this overhead is always more than 100%, it is quite a lot.
Can any one please explain it to me?
Here is some mongodb stats I obtained from the mongo shell.
db.serverStatus() {
"host" : "mongodb-VirtualBox",
"version" : "2.0.2",
"process" : "mongod",
"uptime" : 531693,
"uptimeEstimate" : 460787,
"localTime" : ISODate("2012-01-26T16:32:12.888Z"),
"globalLock" : {
"totalTime" : 531692893756,
"lockTime" : 454374529354,
"ratio" : 0.8545807827977436,
"currentQueue" : {
"total" : 0,
"readers" : 0,
"writers" : 0
},
"activeClients" : {
"total" : 0,
"readers" : 0,
"writers" : 0
}
},
"mem" : {
"bits" : 64,
"resident" : 292,
"virtual" : 98427,
"supported" : true,
"mapped" : 49081,
"mappedWithJournal" : 98162
},
"connections" : {
"current" : 3,
"available" : 816
},
"extra_info" : {
"note" : "fields vary by platform",
"heap_usage_bytes" : 545216,
"page_faults" : 14477174
},
"indexCounters" : {
"btree" : {
"accesses" : 3808733,
"hits" : 3808733,
"misses" : 0,
"resets" : 0,
"missRatio" : 0
}
},
"backgroundFlushing" : {
"flushes" : 8861,
"total_ms" : 26121675,
"average_ms" : 2947.93759169394,
"last_ms" : 119,
"last_finished" : ISODate("2012-01-26T16:32:03.825Z")
},
"cursors" : {
"totalOpen" : 0,
"clientCursors_size" : 0,
"timedOut" : 0
},
"network" : {
"bytesIn" : 44318669115,
"bytesOut" : 50995599,
"numRequests" : 201846471
},
"opcounters" : {
"insert" : 0,
"query" : 3,
"update" : 201294849,
"delete" : 0,
"getmore" : 0,
"command" : 551619
},
"asserts" : {
"regular" : 0,
"warning" : 0,
"msg" : 0,
"user" : 1,
"rollovers" : 0
},
"writeBacksQueued" : false,
"dur" : {
"commits" : 28,
"journaledMB" : 0,
"writeToDataFilesMB" : 0,
"compression" : 0,
"commitsInWriteLock" : 0,
"earlyCommits" : 0,
"timeMs" : {
"dt" : 3062,
"prepLogBuffer" : 0,
"writeToJournal" : 0,
"writeToDataFiles" : 0,
"remapPrivateView" : 0
}
},
"ok" : 1}
db.index.dataSize(): 29791637704
db.index.storageSize(): 33859297120
db.index.totalSize(): 45272200048
db.index.totalIndexSize(): 11412902928
db.runCommand("getCmdLineOpts"): { "argv" : [ "./mongod" ], "parsed" : { }, "ok" : 1 }
My code fragment. I just removed those MongoDB connection codes and keep the cores here.
static void fillupDb()
{
for (double i = 0; i < 1024 * 1024 * 1024 / 3; i++)
{
//Convert the counter i to a 20 bytes of array as KEY
byte[] prekey = BitConverter.GetBytes(i);
byte[] key = new byte[20];
prekey.CopyTo(key, 0);
// Generate a random 100 bytes of VALUE
byte[] value = getRandomBytes(100);
put(key, value);
}
}
public void put(byte[] key, byte[] value)
{
BsonDocument pair = new BsonDocument {
{ "_id", key } /* I am using _id as the index */,
{ "value", value }};
collection.Save(pair);
}
Upvotes: 1
Views: 2675
Reputation: 18615
Well, first of all. How do you measure the size of your input data? A key-value pair can be two strings or a JSON object.
Additionally, every document has some additional padding added to it to facilitate the document growing through subsequent updates. The average padding factor can be retrieved through db.col.stats().paddingFactor
Finally, there's more than just the oplog that may add to your overhead. There's always an index on _id which in your case (since your document are so small) will introduce significant overhead in terms of disk space usage. Unless you disabled it (--nojournal) the journal will add quite a few of bytes to the total as well.
Hope that helps.
Upvotes: 4