om-nom-nom
om-nom-nom

Reputation: 62835

Proper way to store large files (but not media) in mongo?

I know that in Mongo world big data, such as images, music and video goes to the GridFS, small and structed data goes directly to Mongo.

Recently, I've exceeded limit on BSONObj size. My files, that are actually objects like vector<vector<vector<Foo>>> looks like a small regular data (with nested structure), but with a huge size (starting from 20Mb). I'm not sure, but writing them to the GridFS with a prior conversion to bytearray seems to be a bad idea (vectors have a dynamical, non constant length). Is there some walkaround?

Optionaly, I want to perform queries to that objects, e.g. fetch first slice (index) from top-level vector.

Upvotes: 0

Views: 326

Answers (1)

mnemosyn
mnemosyn

Reputation: 46291

Depending on the queries you want to support, two alternatives come to my mind. However, I believe that for data sizes at a couple of hundred MB, these will be slower than doing everything in RAM and using MongoDB as a mere blob store:

1) You could put each dimension in a separate object:

FirstLevel {
   "_id" : ObjectId("..."),
   "Children" : [ ObjectId("..."), ... ]
   // list of vector ids (of the second level)
}

Probably not a very good solution. It still imposes limitations on the number of items you can store, but the number should be pretty large, because it's roughly (16M / id size)^3, maybe much smaller (in the leafs) if Foo is a large object.

Accesses will be pretty slow, because you'll have to walk the tree. Nodes and leafs have somewhat different data formats. Very extensible, however (any dimensionality).

2) Since your data is 3-dimensional, you could store it 'truly 3-dimensional':

Data {
  Coords : { "x" : 121, "y" : 991, "z" : 12 },
  ActualData : { /* Serialized Foo */ }
}

Using a compound index on the {x, y, z} tuple, this supports dimensional slices very well, except for operations like "select all z = 13, then order by x". This approach comes with quite a bit of overhead, and you'll probably need a custom (de)serializer. I don't know the C++ driver, but in C# that is very easy to implement.

This will also support jagged arrays quite well.

2a) If you don't want the overhead of 2), you can squeeze the coordinates into a single long. This would be similar to geohashing, which is what MongoDB does for its geospatial indexes.

Querying slices of coordinates is then a bit mask operation, which is unfortunately not yet supported for queries ($bit only works for updates). You can vote for it, though.

Maybe you could abuse geohashing for your purposes as well, but that'd be rather experimental.

Upvotes: 1

Related Questions