Jesse
Jesse

Reputation: 562

MongoDB (noSQL) when to split collections

So I'm writing an application in NodeJS & ExpressJS. It's my first time I'm using a noSQL database like MongoDB and I'm trying to figure out how to fix my data model.

At start for our project we have written down everything in relationship database terms but since we recently switched from Laravel to ExpressJS for our project I'm a bit stuck on what to do with all my different tables layouts.

So far I have figured out it's better to denormalize your scheme but it does have to end somewhere, right? In the end you can end up storing your whole data in one collection. Well, not enterily but you get the point.

1. So is there a rule or standard that defines where to cut to make multiple collections? I'm having a relation database with users (which are both a client or a store user), stores, products, purchases, categories, subcategories ..

2. Is it bad to define a relationship in a noSQL database? Like every product has a category but I want to relate to the category by an id (parent does the job in MongoDB) but is it a bad thing? Or is this where you choose performance vs structure?

3. Is noSQL/MongoDB ment to be used for such large databases which have much relationships (if they were made in MySQL)?

Thanks in advance

Upvotes: 13

Views: 5472

Answers (2)

Markus W Mahlberg
Markus W Mahlberg

Reputation: 20703

As already written, there are no rules like the second normal form for SQL.

However, there are some best practices and common pitfalls related to optimization for MongoDB which I will list here.

Overuse of embedding

The BSON limit

Contrary to popular believe, there is nothing wrong with references. Assume you have a library of books, and you want to track the rentals. You could begin with a model like this

{
  // We use ISBN for its uniqueness 
  _id: "9783453031456"
  title: "Schismatrix",
  author: "Bruce Sterling",
  rentals: [
    {
      name:"Markus Mahlberg,
      start:"2015-05-05T03:22:00Z",
      due:"2015-05-12T12:00:00Z"
    }
  ]
}

While there are several problems with this model, the most important isn't obvious – there will be a limited number of rentals because of the fact that BSON documents have a size limit of 16MB.

The document migration problem

The other problem with storing rentals in an array would be that this would cause relatively frequent document migrations, which is a rather costly operation. BSON documents are never partitioned and created with some additional space allocated in advance used when they grow. This additional space is called padding. When the padding is exceeded, the document is moved to another location in the datafiles and new padding space is allocated. So frequent additions of data cause frequent document migrations. Hence, it is best practice to prevent frequent updates increasing the size of the document and use references instead.

So for the example, we would change our single model and create a second one. First, the model for the book

{
  _id: "9783453031456",
  title:"Schismatrix",
  author: "Bruce Sterling"
}

The second model for the rental would look like this

{
  _id: new ObjectId(),
  book: "9783453031456",
  rentee: "Markus Mahlberg",
  start: ISODate("2015-05-05T03:22:00Z"),
  due: ISODate("2015-05-05T12:00:00Z"),
  returned: ISODate("2015-05-05T11:59:59.999Z")
}

The same approach of course could be used for author or rentee.

The problem with over normalization

Let's look back some time. A developer would identify the entities involved into a business case, define their properties and relations, write the according entity classes, bang his head against the wall for a few hours to get the triple inner-outer-above-and-beyond JOIN working required for the use case and all lived happily ever after. So why use NoSQL in general and MongoDB in particular? Because nobody lived happily ever after. This approach scales horribly and almost exclusively the only way to scale is vertical.

But the main difference of NoSQL is that you model your data according to the questions you need to get answered.

That being said, let's look at a typical n:m relation and take the relation from authors to books as our example. In SQL, you'd have 3 tables: two for your entities (books and authors) and one for the relation (Who is the author of which book?). Of course, you could take those tables and create their equivalent collections. But, since there are no JOINs in MongoDB, you'd need three queries (one for the first entity, one for its relations and one for the related entities) to find the related documents of an entity. This wouldn't make sense, since the three table approach for n:m relations was specifically invented to overcome the strict schemas SQL databases enforce. Since MongoDB has a flexible schema, the first question would be where to store the relation, keeping the problems arising from overuse of embedding in mind. Since an author might write quite a few books in the years coming, but the authorship of a book rarely, if at all, changes, the answer is simple: We store the authors as a reference to the authors in the books data

{
  _id: "9783453526723",
  title: "The Difference Engine",
  authors: ["idOfBruceSterling","idOfWilliamGibson"]
}

And now we can find the authors of that book by doing two queries:

var book = db.books.findOne({title:"The Difference Engine"})
var authors = db.authors.find({_id: {$in: book.authors})

I hope the above helps you to decide when to actually "split" your collections and to get around the most common pitfalls.

Conclusion

As to your questions, here are my answers

  1. As written before: No, but keeping the technical limitations in mind should give you an idea when it could make sense.
  2. It is not bad – as long as it fits your use case(s). If you have a given category and its _id, it is easy to find the related products. When loading the product, you can easily get the categories it belongs to, even efficiently so, as _id is indexed by default.
  3. I have yet to find a use case which can't be done with MongoDB, though some things can get a bit more complicated with MongoDB. What you should do imho is to take the sum of your functional and non functional requirements and check wether the advantages outweigh the disadvantages. My rule of thumb: if one of "scalability" or "high availability/automatic failover" is on your list of requirements, MongoDB is worth more than a look.

Upvotes: 19

user3561036
user3561036

Reputation:

The very "first" thing to consider when choosing an "NoSQL" solution for storage over an "Relational" solution is that things "do not work in the same way" and therefore respond differently by design.

More specifically, solutions such as MongoDB are "not meant" to "emulate" the "relational join" structure that is present in many SQL and therefore "relational" backends, and that they are moreover intended to look at data "joins" in a very different way.

This arrives at your "questions" as follows:

  1. There really is no set "rule", and understand that the "rules" of denormalization do not apply here for the basic reason of why NoSQL solutions exist. And that is to offer something "different" that may work well for your situation.

  2. Is it bad? Is it Good? Both are subjective. Considering point "1" here, there is the basic consideration that "non-relational" or "NoSQL" databases are designed to do things "differently" than a relational system is. So therefore there is usually a "penalty" to "emulating joins" in a relational manner. Specifically for MongoDB this means "additional requests". But that does not mean you "cannot" or "should not" do that. Rather it is all about how your usage pattern works for your application.

  3. Re-capping on the basic points made above, NoSQL in general is designed to solve problems that do not suit the traditional SQL and/or "relational" design pattern, and therefore replace them with something else. The "ultimate goal" here is for you to "rethink your data access patterns" and evolve your application to use a storage model that is more suited to how you access it in your application usage.

In short, there are no strict rules, and that is also part of the point in moving away from "nth-normal-form" rules. NoSQL solutions such as MongoDB allow for "nested structure" storage that typical SQL/Relational solutions do not provide in an efficient form.

Another side of this is considering that operations such as "joins" do not "scale" well over "big data" forms, therefore there exists the different way to "join" by offering concepts such as "embedded data structures", such as MongoDB does.

You would do well to real some guides on the subjects of how many NoSQL solutions approach storing and accessing data. This is ultimately what you need to decide on to determine which is best for you and your application.

At the end of the day, it should be about realising when a SQL/Relational model does not meet your needs, and then choosing something else.

Upvotes: 5

Related Questions