Reputation: 17107
I imported a csv into MongoDB using Compass. The csv file has a size of 755 MB, but in Mongo, the collection shows a size of 2.8 GB? Why is this? Also, the csv has a lot of sprasely populated fields. In Mongo, these fields are set to empty strings for most of the rows(documents). Is there a option to say only create the field for a particular document if the value is not missing.
Upvotes: 0
Views: 1039
Reputation: 7621
Loading a sparse file can unnecessarily eat up a lot of space. Consider 14,493,120 rows of this line:
foo,bar,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,zip
The file size is 797,229,120 bytes. So let's go ahead and load it with mongoimport with NO --ignoreBlanks
. On the MacBook, this takes 8m45s to load and produces an avg size doc of 490 bytes for a total of 7,101,710,784 uncompressed bytes. The WiredTiger storage engine will achieve a 6.8x compression on this to yield an on-disk rep of only 1,044,369,232 bytes and the _id index of 145,854,464 bytes. Call it 1200MB total. OK, somewhat bigger than the input 797MB flatfile.
Load it with --ignoreBlanks
and the landscape changes. It takes only 5m55s to load and produces an avg size doc of 63 bytes -- about 7.7x smaller. The total uncompressed size is 72,758,890 bytes, also about 7.7x smaller. The compression ratio is down to 3.2x but this still yields an on-disk rep of 286,487,020 bytes. Not surprisingly, the _id index is the same size (145MB) but 286MB + 145MB ~= 432MB. Compared to 797MB of raw CSV, the point should be clear: Loading sparse files into MongoDB with --ignoreBlanks
yields a significantly smaller footprint; in this case, close to 2x smaller on-disk including indexes than the raw CSV files
Upvotes: 3