Reputation: 1615
UPDATED
We have a growing MongoDB database where the load is composed mostly of inserts. It is a two shards database with three collections and MongoDB is currently version 2.6.6. Each shard is a replica set with two nodes and one arbiter.
By analyzing how disk space is used with db.stats()
these numbers are found:
shard0:
dataSize: 95 Gb
storageSize: 99 Gb
fileSize: 107 Gb
shard1:
dataSize: 109 Gb
storageSize: 112 Gb
fileSize: 121 Gb
Partitioning is done by shard key which is based on a date. Effectively shard0 is filled with new data while shard1 remains stable by data usage. Occasionally we update the shard key to a newer date and data migrates from shard0 to shard1.
Padding factor on all three collections is set to 1 which should make new data allocation efficient, where each document insert should occupy the same amount as the size of the document itself. However there is a certain amount of "wasted" space that seems pretty large for a database that should be fairly compacted.
Here are the data in three consecutive days:
Shard | Data Size | Storage Size | File Size
-----------------------------------------------
shard0 | 90 GB | 93 GB | 101 GB
shard0 | 92 GB | 95 GB | 103 GB
shard0 | 94 GB | 97 GB | 105 GB
File size reported by MongoDB is about 11 GB larger than data size (this is 12%).
According to this link part of that space could be attributed to preallocated data files. Three (3) collections by 2 GB will at maximum consume 6 GB. Record deletions are extremely rare and could account for wasted space in kilobytes. What about the oplog
and journal
, do they account in some of the size parameters or not?
What are we missing? And how this 5 GB (11 GB - 6 GB) is actually being used? Can it be compacted?
Here are the results of db.stats(1024*1024*1024)
command:
{
"raw" : {
"rs0/l0.example.com:27018,l1.example.com:27018" : {
"db" : "logdata",
"collections" : 5,
"objects" : 30222965,
"avgObjSize" : 3409.2183424094887,
"dataSize" : 95,
"storageSize" : 99,
"numExtents" : 106,
"indexes" : 10,
"indexSize" : 6,
"fileSize" : 107,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"ok" : 1
},
"rs1/l2.example.com:27018,l3.example.com:27018" : {
"db" : "logdata",
"collections" : 4,
"objects" : 22676428,
"avgObjSize" : 5185.006179632877,
"dataSize" : 109,
"storageSize" : 112,
"numExtents" : 99,
"indexes" : 8,
"indexSize" : 6,
"fileSize" : 121,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"ok" : 1
}
},
"objects" : 52899393,
"avgObjSize" : 4170.319437597327,
"dataSize" : 204,
"storageSize" : 211,
"numExtents" : 205,
"indexes" : 18,
"indexSize" : 12,
"fileSize" : 228,
"extentFreeList" : {
"num" : 0,
"totalSize" : 0
},
"ok" : 1
}
Upvotes: 1
Views: 2267
Reputation: 934
You could try to use Mongo's new WiredTiger storage engine. For me it reduces the disk space usage by 75%
Upvotes: 1
Reputation: 1623
Well the data-set is going to grow as you feed it, but I would at your size at the very least shard each collection to it's own mongo instance possibly even machine, while this will not directly affect the size (it may make it slightly larger), the distribution will give you insight as to which of the three collections individual rates of growth are and you should see better throughput (assuming you do not use a single storage for all servers)
Upvotes: 0
Reputation: 222969
Most probably you are missing the fact that mongo also preallocate storage for a future use:
The total size in bytes of the data files that hold the database. This value includes preallocated space and the padding factor. The value of fileSize only reflects the size of the data files for the database and not the namespace file.
You can read more about each of the numbers here.
Upvotes: 0