Better mongodb data model for nested information

Question

I am designing some data model for mango db, I have some requirement similar to below json.

Single_Collection.

{

"collegeid": 1234,
"Name": "aaaa",
 "otherinfo": 1,

"studnet":[
    {
        "stdid": 1,
        "name": "n1"
    },
    {
        "stdid": 2,
        "name": "n2"
    }
]
}

Two Collections.

College Info

{
"collegeid": 1234,
"Name": "aaaa",
 "otherinfo": 1
}

Student Info collection

    {
    "collegeid": 1234,
    "stdid": 1,
    "name": "n1"
    }

    {
    "collegeid": 1234,
    "stdid": 2,
    "name": "n2"
    }

Which is the better way interms of reading performance(Having single collection or separating it out), I have more read like given student ID find out the college ID . Student ID list will be very big.

Also I perform more student insertion operations

yellowB · Accepted Answer

IMO，each model design has its own Pros & Cons, what we say "better way" is depending on your use cases(How you query the data? Do you need all the data at the beginning? Do you need paging? etc...)

Let's start from your requirements.

Your requirements

Given a college ID, find out the students in this college.
Given student ID, find out his college ID.

Relation between objects

Obviously college & student is 1:m mapping, because a lot of students in one college but each student can stay in one college only.

I will show you some different model designs and also provide the Pros & Cons for each model.

Approach 1 - Embed students into college

This is the design you mentioned as a single collection.

{
   "collegeid":1234,
   "Name":"aaaa",
   "otherinfo":1,
   "studnet":[
      {
         "stdid":1,
         "name":"n1"
      },
      {
         "stdid":2,
         "name":"n2"
      }
   ]
}

Pros：

Very natural model for human to read & front-end to display.
Good performance when loading the college and all the students in it. Because the data stored in the engine is in continuous. The engine needs fewer I/O to do that.

Cons:

If you have huge number of students in a college, the size of the document will be very big. It will be inefficient if you add/remove/update student frequently.
There is not a quick way to achieve requirement(2). Since we only maintain the mapping from college -> students, you have to walk through all the documents to find out which college contains the specified studentID.

Approach 2 - Student has reference to college

This is the design you mentioned as a Two Collections. It is similar to the RDBMS tables, student model owns a reference key point to its college.

College collection:

{
   "collegeid":1234,
   "Name":"aaaa",
   "otherinfo":1
}

Student collection:

{
   "collegeid":1234,
   "stdid":1,
   "name":"n1"
}
{
   "collegeid":1234,
   "stdid":2,
   "name":"n2"
}

Pros：

Can achieve requirement(1) and (2). Remember to add index on "collegeid" and "stdid" field.
Every document can be maintained in small size, it is easy for the engine to store the data.

Cons:

College and students are separated. It will be slower than Approach 1 if loading a college and all its students(Need two queries).
You need to merge college and students together by yourself before displaying in UI.

Approach 3 - Duplicated data in college and students

This approach looks like we mix up approach 1 and approach 2. We have two collections： college will have its students embeded in itself, and also a separated student collection. So, the student data is duplicated in both collections.

College collection:

{
   "collegeid":1234,
   "Name":"aaaa",
   "otherinfo":1,
   "studnet":[         // duplicated here!
      {
         "stdid":1,
         "name":"n1"
      },
      {
         "stdid":2,
         "name":"n2"
      }
   ]
}

Student collection:

{
   "collegeid":1234,
   "stdid":1,
   "name":"n1"
}
{
   "collegeid":1234,
   "stdid":2,
   "name":"n2"
}

Pros：

You have all Pros from Approach 1 & Approach 2.

Cons:

The document in college collection will grow to very big.
You have to keep the data from college collection and student collection Synchronous by yourself.

Approach 4 - Duplicated data in college(Only studentID) and students

This is a variant from Approach 3. We assume that your use case is:

User can search for a college.
User click one college in search result.
A new UI show all student IDs to user(maybe in a grid or list).
User click one student ID.
System loads the full data of the specified student and show to user in another UI.

In a short word, user does not need the full data of all students at the beginning, he just need students' basic info(e.g. student ID). If user accepts such scenario, you can have below model:

College collection:

{
   "collegeid":1234,
   "Name":"aaaa",
   "otherinfo":1,
   "studnetIds":[1, 2]  // only student IDs are duplicated
}

Student collection:

{
   "collegeid":1234,
   "stdid":1,
   "name":"n1"
}
{
   "collegeid":1234,
   "stdid":2,
   "name":"n2"
}

College only has the studnet IDs in it. This is the difference compared to Approach 3.

Pros：

Can achieve requirement(1) and (2).
You don't need to worry about the college document grows to huge size. Since it only owns the student IDs.
If user accepts above scenario, this will be a better desgin on the balance of performance/complex/data size.

Cons:

Suitable to specified use case, if the requirement is changed in the future, will break the scenario and this model will be not good.

Summary

You should be very clear to your use cases.
Based on use cases, compare the approachs to see whether you can accept the Pros & Cons.
Load testing is important!

Better mongodb data model for nested information

Answers (1)

Your requirements

Relation between objects

Approach 1 - Embed students into college

Approach 2 - Student has reference to college

Approach 3 - Duplicated data in college and students

Approach 4 - Duplicated data in college(Only studentID) and students

Summary

Related Questions