MongoDB (noSQL) when to split collections

Question

So I'm writing an application in NodeJS & ExpressJS. It's my first time I'm using a noSQL database like MongoDB and I'm trying to figure out how to fix my data model.

At start for our project we have written down everything in relationship database terms but since we recently switched from Laravel to ExpressJS for our project I'm a bit stuck on what to do with all my different tables layouts.

So far I have figured out it's better to denormalize your scheme but it does have to end somewhere, right? In the end you can end up storing your whole data in one collection. Well, not enterily but you get the point.

1. So is there a rule or standard that defines where to cut to make multiple collections? I'm having a relation database with users (which are both a client or a store user), stores, products, purchases, categories, subcategories ..

2. Is it bad to define a relationship in a noSQL database? Like every product has a category but I want to relate to the category by an id (parent does the job in MongoDB) but is it a bad thing? Or is this where you choose performance vs structure?

3. Is noSQL/MongoDB ment to be used for such large databases which have much relationships (if they were made in MySQL)?

Thanks in advance

Markus W Mahlberg · Accepted Answer

As already written, there are no rules like the second normal form for SQL.

However, there are some best practices and common pitfalls related to optimization for MongoDB which I will list here.

Overuse of embedding

The BSON limit

Contrary to popular believe, there is nothing wrong with references. Assume you have a library of books, and you want to track the rentals. You could begin with a model like this

{
  // We use ISBN for its uniqueness 
  _id: "9783453031456"
  title: "Schismatrix",
  author: "Bruce Sterling",
  rentals: [
    {
      name:"Markus Mahlberg,
      start:"2015-05-05T03:22:00Z",
      due:"2015-05-12T12:00:00Z"
    }
  ]
}

While there are several problems with this model, the most important isn't obvious – there will be a limited number of rentals because of the fact that BSON documents have a size limit of 16MB.

The document migration problem

The other problem with storing rentals in an array would be that this would cause relatively frequent document migrations, which is a rather costly operation. BSON documents are never partitioned and created with some additional space allocated in advance used when they grow. This additional space is called padding. When the padding is exceeded, the document is moved to another location in the datafiles and new padding space is allocated. So frequent additions of data cause frequent document migrations. Hence, it is best practice to prevent frequent updates increasing the size of the document and use references instead.

So for the example, we would change our single model and create a second one. First, the model for the book

{
  _id: "9783453031456",
  title:"Schismatrix",
  author: "Bruce Sterling"
}

The second model for the rental would look like this

{
  _id: new ObjectId(),
  book: "9783453031456",
  rentee: "Markus Mahlberg",
  start: ISODate("2015-05-05T03:22:00Z"),
  due: ISODate("2015-05-05T12:00:00Z"),
  returned: ISODate("2015-05-05T11:59:59.999Z")
}

The same approach of course could be used for author or rentee.

The problem with over normalization

Let's look back some time. A developer would identify the entities involved into a business case, define their properties and relations, write the according entity classes, bang his head against the wall for a few hours to get the triple inner-outer-above-and-beyond JOIN working required for the use case and all lived happily ever after. So why use NoSQL in general and MongoDB in particular? Because nobody lived happily ever after. This approach scales horribly and almost exclusively the only way to scale is vertical.

But the main difference of NoSQL is that you model your data according to the questions you need to get answered.

That being said, let's look at a typical n:m relation and take the relation from authors to books as our example. In SQL, you'd have 3 tables: two for your entities (books and authors) and one for the relation (Who is the author of which book?). Of course, you could take those tables and create their equivalent collections. But, since there are no JOINs in MongoDB, you'd need three queries (one for the first entity, one for its relations and one for the related entities) to find the related documents of an entity. This wouldn't make sense, since the three table approach for n:m relations was specifically invented to overcome the strict schemas SQL databases enforce. Since MongoDB has a flexible schema, the first question would be where to store the relation, keeping the problems arising from overuse of embedding in mind. Since an author might write quite a few books in the years coming, but the authorship of a book rarely, if at all, changes, the answer is simple: We store the authors as a reference to the authors in the books data

{
  _id: "9783453526723",
  title: "The Difference Engine",
  authors: ["idOfBruceSterling","idOfWilliamGibson"]
}

And now we can find the authors of that book by doing two queries:

var book = db.books.findOne({title:"The Difference Engine"})
var authors = db.authors.find({_id: {$in: book.authors})

I hope the above helps you to decide when to actually "split" your collections and to get around the most common pitfalls.

Conclusion

As to your questions, here are my answers

As written before: No, but keeping the technical limitations in mind should give you an idea when it could make sense.
It is not bad – as long as it fits your use case(s). If you have a given category and its _id, it is easy to find the related products. When loading the product, you can easily get the categories it belongs to, even efficiently so, as _id is indexed by default.
I have yet to find a use case which can't be done with MongoDB, though some things can get a bit more complicated with MongoDB. What you should do imho is to take the sum of your functional and non functional requirements and check wether the advantages outweigh the disadvantages. My rule of thumb: if one of "scalability" or "high availability/automatic failover" is on your list of requirements, MongoDB is worth more than a look.