Reputation: 1503
What is the best practices, or tradeoffs, or effectiveness, of the two options below for maintaining consistency of data in MongoDB?
For example, let's say there are comments
and users
. With option 1, each comment
would contain:
{
user_id:
user_displayname:
user_gravatar:
[comment fields]
}
If the user
decided to change his or her displayname, the user
object would change but also a script would run the required MongoDB commands to update all the user
's comments
to reflect the change.
With option 2, each comment
would contain:
{
user_id:
[comment fields]
}
If the user
decided to change his or her displayname, it would only be changed in the user
object itself. When a comment
is accessed without hitting the cache, it'll associate the user
object with the comment
object in the cache. That way in the future, if this comment
is accessed again while it is still in the cache, both user
and comment
queries are skipped. (am I basically describing the built in MongoDB cache?)
Is it worth doing the data redundancy described in option 1 at all? or is MongoDB smart enough that additional but equivalent queries are already cached? or is it worth using something else such as Redis to make a cache layer myself?
Thanks!
Upvotes: 1
Views: 1294
Reputation: 836
If you are talking about a caching mechanism for 100s of GB of data, you are talking about a serious trade off. Anything less than 5 GB of data, the tradeoffs do not matter. Between 100GB and 5GB, there is a grey area.
The worst case scenario for your data is this:
200 GB of data. 4,000 reads per second. A user with 9,000 comments changes his / her name. Your application also indexes comments on this name value. Your application must then update 9,000 comments and 9,000 index keys. This will cause serious drag in your application for a while.
Then, we must also pose the question for something as simple as names on comments: "Do you have to update the names on old comments?"
When you follow a new person on Twitter, your past timeline does not inherit the person's past tweets. Only your new timeline. Same with comments, why should you update the person's name on past comments?
So, I would add a #3 to your list: "Do not update users' names"
Upvotes: 1
Reputation: 3760
There is no "cache" in MongoDB itself. MongoDB uses memory-mapped files, and its performance depends very much on whether it can keep the most frequently used documents, your application's "working set", mapped in main memory rather than having to page each document in from disk prior to accessing it.
You are describing a denormalized database design, where each document contains attributes that would not be there in a normalized form. This can make sense, and it is in fact a very common technique with MongoDB, if it allows you to fetch all the data you need in a single operation, rather than having to do multiple queries.
The downside, as you point out, is that it requires more expensive updates, since you need to update all the documents into which a particular attribute has been denormalized. The downside is also that if your documents are larger, it may be more difficult to keep the working set in memory.
The answer therefore depends on your data access patterns. Generally, if your application is read-heavy, and it tends to need all of these denormalized attributes together, then the denormalizing approach is a good choice. If the application is write-heavy, and especially if it makes frequent updates to those particular attributes, then denormalization is not a good choice.
Upvotes: 1