ernirulez
ernirulez

Reputation: 751

Mongodb: about performance and schema design

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.

Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?

OPTION 1: two collections (videos collection and comments collection)

Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.

Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.

OPTION 2: single collection (videos collection with the comments embedded)

Pros: You retrieve videos and its comments with a single query.

Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.

So what do you think? Are my guesses true?

Upvotes: 2

Views: 244

Answers (3)

Alex
Alex

Reputation: 21766

As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.

How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.

Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.

Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.

Upvotes: 1

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20722

As a caller in the desert, I have to say that embedding should only be used under very special circumstances:

  1. The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
  2. Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
  3. Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!

I have written an article about the above, which describes the respective problems in greater detail.

And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be

  1. For a given user, what are the videos?
  2. What are the newest videos (with certain tags)?
  3. For a given video, what are the comments?

Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.

Let's say it is the _id. So, with it, you'd be able to have paged comments easily:

db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )

hth

Upvotes: 3

Tanveer Dayan
Tanveer Dayan

Reputation: 506

You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.

You should either compromise persistence over reads or vise versa depending on your application needs.

Hence it is important to choose your db wisely.

Upvotes: 0

Related Questions