Update embedded document in Mongodb: Performance issue?

Question

I am new to Mongodb and I heard that Mongodb is good for massive amount of read and write operations. Embedded document is one of the features that make it happen. But I am not sure if it is also a cause of performance issue. Book document example:

{
    "_id": 1,
    "Authors": [
        {
            "Email": "email",
            "Name": "name"
        }
    ],
    "Title": "title",
    ...
}

If there are thousands of books by one author, and his email needs to be updated, I need to write some query which can

search through all book documents, pick out those thousands ones with this author
update author's email field across these book documents

These operations do not seem efficient. But this type of update is ubiquitous, I believe the developers have considered this. So, where did I get it wrong?

chridam · Accepted Answer

Your current embedded schema design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk.

If your application frequently accesses books information along with the Authors data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data.

To illustrate this, say you want all books by one author have his email field updated, this can be done with one single (atomic) operation, which is not a performance issue with MongoDB:

db.books.updateMany(
    { "Authors.name": "foo" },
    {
        "$set": { "Authors.$.email": "new@email.com" }
    }
);

or with earlier MongoDB versions:

db.books.update(
    { "Authors.name": "foo" },
    {
        "$set": { "Authors.$.email": "new@email.com" }
    },
    { "multi": true }
)

In the above, you use the positional $ operator which facilitates updates to arrays that contain embedded documents by identifying an element in an array to update without explicitly specifying the position of the element in the array. Use it with the dot notation on the $ operator.

For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, especically Model One-to-Many Relationships with Embedded Documents.

The other design option which you can consider is referencing documents where you follow a normalized schema. For example:

// db.books schema
{
    "_id": 3
    "authors": [1, 2, 3] // <-- array of references to the author collection
    "title": "foo"
}

// db.authors schema
/*
1
*/
{
    "_id": 1,    
    "name": "foo",
    "surname": "bar",
    "address": "xxx",
    "email": "foo@mail.com"
}
/*
2
*/
{
    "_id": 2,    
    "name": "abc",
    "surname": "def",
    "address": "xyz",
    "email": "abc@mail.com"
}
/*
3
*/
{
    "_id": 3,    
    "name": "alice",
    "surname": "bob",
    "address": "xyz",
    "email": "alice@mail.com"
}

The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of author documents per give book entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.

For querying a normalized schema, you can consider using the aggregation framework's $lookup operator which performs a left outer join to the authors collection in the same database to filter in documents from the books collection for processing.

Thus said, I believe your current schema is a better approach than creating a separate collection of authors since separate collections require more work i.e. finding an book + its authors is two queries and requires extra work whereas the above schema embedded documents are easy and fast (single seek). There are no big differences for inserts and updates. So, separate collections are good if you need to select individual documents, need more control over querying, or have huge documents. Embedded documents are also good when you want the entire document, the document with a $slice of the embedded authors, or with no authors at all.

The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.

Ref:

MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland

Update embedded document in Mongodb: Performance issue?

Answers (2)

Related Questions