DeathNote
DeathNote

Reputation: 264

MongoDB workaround for document above 16mb size?

The collection of MongoDB I am working on takes sensor data from cellphone and it is pinged to the server like every 2-6 seconds.

The data is huge and the limit of 16mb is crossed after 4-5 hours, there don't seem to be any work around for this?

I have tried searching for it on Stack Overflow and went through various questions but no one actually shared their hack.

Is there any way... on the DB side maybe which will distribute the chunk like it is done for big files via gridFS?

Upvotes: 14

Views: 18319

Answers (2)

pieperu
pieperu

Reputation: 2762

To fix this problem you will need to make some small amendments to your data structure. By the sounds of it, for your documents to exceed the 16mb limit, you must be embedding your sensor data into an array in a single document.

I would not suggest using GridFS here, I do not believe it to be the best solution, and here is why.

There is a technique known as bucketing that you could employ which will essentially split your sensor readings out into separate documents, solving this problem for you.

The way it works is this:

Lets say I have a document with some embedded readings for a particular sensor that looks like this:

{
    _id : ObjectId("xxx"),
    sensor : "SensorName1",
    readings : [
        { date : ISODate("..."), reading : "xxx" },
        { date : ISODate("..."), reading : "xxx" },
        { date : ISODate("..."), reading : "xxx" }
    ]
}

With the structure above, there is already a major flaw, the readings array could grow exponentially, and exceed the 16mb document limit.

So what we can do is change the structure slightly to look like this, to include a count property:

{
    _id : ObjectId("xxx"),
    sensor : "SensorName1",
    readings : [
        { date : ISODate("..."), reading : "xxx" },
        { date : ISODate("..."), reading : "xxx" },
        { date : ISODate("..."), reading : "xxx" }
    ],
    count : 3
}

The idea behind this is, when you $push your reading into your embedded array, you increment ($inc) the count variable for every push that is performed. And when you perform this update (push) operation, you would include a filter on this "count" property, which might look something like this:

{ count : { $lt : 500} }

Then, set your Update Options so that you can set "upsert" to "true":

db.sensorReadings.update(
    { name: "SensorName1", count { $lt : 500} },
    {
        //Your update. $push your reading and $inc your count
        $push: { readings: [ReadingDocumentToPush] }, 
        $inc: { count: 1 }
    },
    { upsert: true }
)

see here for more info on MongoDb Update and the Upsert option:

MongoDB update documentation

What will happen is, when the filter condition is not met (i.e when there is either no existing document for this sensor, or the count is greater or equal to 500 - because you are incrementing it every time an item is pushed), a new document will be created, and the readings will now be embedded in this new document. So you will never hit the 16mb limit if you do this properly.

Now, when querying the database for readings of a particular sensor, you may get back multiple documents for that sensor (instead of just one with all the readings in it), for example, if you have 10,000 readings, you will get 20 documents back, each with 500 readings each.

You can then use aggregation pipeline and $unwind to filter your readings as if they were their own individual documents.

For more information on unwind see here, it's very useful

MongoDB Unwind

I hope this helps.

Upvotes: 35

PaulShovan
PaulShovan

Reputation: 2134

You can handle this type of situations using GridFS in MongoDB.

Instead of storing a file in a single document, GridFS divides the file into parts, or chunks 1, and stores each chunk as a separate document. By default, GridFS uses a chunk size of 255 kB; that is, GridFS divides a file into chunks of 255 kB with the exception of the last chunk. The last chunk is only as large as necessary. Similarly, files that are no larger than the chunk size only have a final chunk, using only as much space as needed plus some additional metadata.

The documentation of GriFS contains almost everything you need to implement GridFS. You can follow it.

As your data is stream, you can try as following...

gs.write(data, callback)

where data is a Buffer or a string, callback gets two parameters - an error object (if error occured) and result value which indicates if the write was successful or not. While the GridStore is not closed, every write is appended to the opened GridStore.

You can follow this github page for streaming related information.

Upvotes: 0

Related Questions