Reputation: 21005
Related to Ways to implement data versioning in MongoDB and structure of documents for versioning of a time series on mongodb
What data structure should I adopt for versioning when I also need to be able to handle queries?
Suppose I have 8500 documents of the form
{ _id: '12345-11',
noFTEs: 5
}
Each month I get details of a change to noFTEs
in about 30 docs, I want to store the new data along with the previous one(s), together with a date.
That would seem to result in:
{ _id: '12345-11',
noFTEs: {
'2015-10-28T00:00:00+01:00': 5,
'2015-1-8T00:00:00+01:00': 3
}
}
But I also want to be able to do searches on the most recent data (e.g. noFTEs > 4
, and the element should be considered as 5, not 3). At that stage I all I know is I want to use the most recent data, and will not know the key. So an alternative would be an array
{ _id: '12345-11',
noFTEs: [
{date: '2015-10-28T00:00:00+01:00', val: 5},
{date: '2015-1-8T00:00:00+01:00', val: 3}
}
}
Another alternative - as suggested by @thomasbormans in the comments below - would be
{ _id: '12345-11',
versions: [
{noFTEs: 5, lastModified: '2015-10-28T00:00:00+01:00', other data...},
{noFTEs: 3, lastModified: '2015-1-8T00:00:00+01:00', other...}
}
}
I'd really appreciate some insights about considerations I need to make before jumping all the way in, I fear I am resulting in a query that is pretty high workload for Mongo. (In practise there are 3 other fields that can be combined for searching, and one of these is also likely to see changes over time.)
Upvotes: 1
Views: 2828
Reputation: 3637
When you model a noSQL database, there are some things you need to keep in mind.
First of all is the size of each document. If you use arrays in your document, be sure that it won't pass the 16 Mb size limit for each document.
Second thing, you must model your database in order to retrieve things easily. Some "denormalization" is acceptable in favor of speed and easy of use to your application.
So if you need to know the current noFTE value, and you need to keep a history only to audit purposes, you could go with 2 collections:
collection["current"] = [
{
_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
}
]
collection["history"] = [
{ _id: ...an object id...
source_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
},
{
_id: ...an object id...
source_id: '12345-11',
noFTEs: 3,
lastModified: '2015-1-8T00:00:00+01:00'
}
]
By doing this way, you keep your most frequent accessed records smaller (I suppose the current version is more frequently accessed). This will make mongo more prone to keep the "current" collection in memory cache. And documents will be retrieved faster from disk, because they are smaller.
I seem this design to be best in therms of memory optimisation. But this decision is directly related on what use you will make of your data.
EDIT: I changed my original response in order to create separated inserts for each history entry. In my original answer, I tried to keep your history entries close to your original solution to focus on denormalization topic. However, keeping history in an array is a poor design decision and I decided to make this answer more complete.
The choice to keep separated inserts in the history instead of creating an array are many:
1) Whenever you change the size of a document (for example, inserting more data into it), mongo may need to move this document to an empty part of your disk in order to accommodate the larger document. This way, you end up creating storage gaps making your collections larger.
2) Whenever you insert a new document, Mongo tries to predict how big it can become based on previous inserts/updates. This way, if your history documents' sizes are similar, the padding factor will become next to optimal. However, when you maintain growing arrays, this prediction won't be good and mongo will waste space with padding.
3) In the future, you will probably want to shrink your history collection if it grows too large. Usually, we define a policy for history retention (example: 5 years), and you can backup and prune data older than that. If you have kept separated documents for each history entry, it will be much easier to do this operation.
I can find other reasons, but I believe those 3 are enough to get into the point.
Upvotes: 3
Reputation: 69663
To add versioning without compromising usability and speed of access for the most recent data, consider creating two collections: one with the most recent documents and one to archive the old versions of the documents when they get changed.
You can use currentVersionCollection.findAndModify
to update a document while also receiving the previous (or new, depending on parameters) version of said document in one command. You then just need to remove the _id
of the returned document, add a timestamp and/or revision number (when you don't have these already) and insert it into the archive collection.
By storing each old version in an own document you also avoid document growth and prevent documents from bursting the 16MB document limit when they get changed a lot.
Upvotes: 1