Dima
Dima

Reputation: 4128

MongoDB: partition large dataset yourself?

Looking for a very clean and cheap way to get rid of old data and reclaim disk space back to OS without pain.

I store sampling data (time stamp + bunch of properties). Lots of it. Each sample is a single document, and collection gets huge.

Capped collections are out of the question because I need to store data based on the time range, not on the size it takes. TTL collections are no good because of the space required by TTL index, it may grow ridiculously large. Sharding is out for some other reasons.

So what I thought of doing is to partition the whole thingy myself. I'd store partitions of data (for example weekly bulks) separately. Each week I'd simply start new 'partition'. Also, each week I'd drop some old 'partitions'. Brutal and simple. I remove large amount of indexed data, hence the drop instead of remove documents.

The question here is what should I use for 'partition'? Collections or a Database? Technically I could go either way, the app is Java based, I could easily manage bunch of collections or databases.

My concern with dropping collections is that mongodb used to have a problem with reclaiming disk space back to the OS. Then it tries to reuse it, there could be fragmentation issues, need to do repair().. and stuff like that.

Will be dropping the database be a more efficient way?

Again, I need the least disruptive way to get rid of terra bytes of old data while continuing to pump new data in. If you have an experience with either approach, please share.

Upvotes: 0

Views: 192

Answers (1)

Sammaye
Sammaye

Reputation: 43884

Each week I'd simply start new 'partition'.

One common solution is just to create a collection per week, name it something like recordings_wk53 and then just drop that collection each week.

Collections or a Database?

Collections will be easier to manage within your application and might be faster since there are less files to delete (etc etc) BUT it WILL NOT free up disk space to the OS.

Now you could do this with databases relatively easily, you could create a connection per week within your application, so long as your only managing 100's it should be fine, and since your not using them as a means to scale vertically the OPs patterns etc should be quite good for the use case.

Will be dropping the database be a more efficient way?

Hmm this is a very subjective and opinionated question but I would probably go for collection, then MongoDB can just reuse that collection instantly without having to reallocate all that space, I mean that is why MongoDB does not release space back to the OS; so it doesn't have to retake it which can be slow.

Upvotes: 1

Related Questions