Are there memory performance benefits in using projections between stages in MongoDB's aggregation pipeline?

Question

A collection may have multiple columns that each contain substantial data like article content or image data etc.

When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.

A trivial example: we have a requirement to find all articles that don't have a matching author from an authors collection. I would assume that we would not project unnecessary article fields onwards. The same with the $lookup to authors where we only need and id for this purpose. Demo:

db.getCollection("articles").aggregate(

    [
        {
            $match: {
                somecolumn: { "$ne": null, $exists: true }
            }
        },

        {
            $project: { 
                id: 1,
                authorId: 1
            }
        },

        {
            $lookup: {
                      from: "authors",
                      let: { author: "$authorId" },
                      pipeline: [
                        {
                          $match: {
                              $expr:
                                {
                                    $eq: ["$$author","$id"] }
                              }

                        },
                        { $project: { id: 1, } }
                      ],
                      as: "author"
                    }
        },

        {
            $match: {
                "author.0": {$exists: false}
            }
        }
    ]
);

Am I correct in this assumption or do the internal processes work differently?

prasad_ · Accepted Answer

When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.

Yes, this is correct. Each aggregation stage has a memory limit, and working within this limit will ensure that there will not be performance issues.

From the manual (pipeline memory restrictions):

Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.

When the query uses the allowDiskUse it will affect the performance, as using disk is very much slower than the memory.

In addition, here are some important practices:

Use as less stages as possible; more stages means that the documents need to be examined at each stage. This is additional processing and resources.
Avoid unnecessary $project stages.
Specifying certain stages at the beginning of the pipeline is important. The stages, $match and $sort can use indexes only in the initial stages. Also, $match and $limit can reduce the number of documents to process down the pipeline if used early in the pipeline.
Use explain and study the query plans to find if indexes are being used (in $match and $sort) or any indexes need defining for query optimization. Note that indexes work little bit differently with aggregations and also not all stages can use indexes.

As such aggregation framework can automatically re-order some stages for optimization. For further details see this documentation on aggregation pipeline optimization.

Are there memory performance benefits in using projections between stages in MongoDB's aggregation pipeline?

Answers (2)

Related Questions

Are there memory performance benefits in using projections between stages in MongoDB&#39;s aggregation pipeline?

Answers (2)

Related Questions

Are there memory performance benefits in using projections between stages in MongoDB's aggregation pipeline?