Samuel Goldenbaum
Samuel Goldenbaum

Reputation: 18929

Are there memory performance benefits in using projections between stages in MongoDB's aggregation pipeline?

A collection may have multiple columns that each contain substantial data like article content or image data etc.

When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.

A trivial example: we have a requirement to find all articles that don't have a matching author from an authors collection. I would assume that we would not project unnecessary article fields onwards. The same with the $lookup to authors where we only need and id for this purpose. Demo:

db.getCollection("articles").aggregate(

    [
        {
            $match: {
                somecolumn: { "$ne": null, $exists: true }
            }
        },

        {
            $project: { 
                id: 1,
                authorId: 1
            }
        },

        {
            $lookup: {
                      from: "authors",
                      let: { author: "$authorId" },
                      pipeline: [
                        {
                          $match: {
                              $expr:
                                {
                                    $eq: ["$$author","$id"] }
                              }

                        },
                        { $project: { id: 1, } }
                      ],
                      as: "author"
                    }
        },

        {
            $match: {
                "author.0": {$exists: false}
            }
        }
    ]
);

Am I correct in this assumption or do the internal processes work differently?

Upvotes: 2

Views: 1656

Answers (2)

Asya Kamsky
Asya Kamsky

Reputation: 42352

In general, the only place you want to have $project is as the very last stage in order to return only the needed fields back to the client, in some cases maybe renaming or recomputing them.

You do not need to have $project earlier in the pipeline to "trim down the fields" because the aggregation already does dependency analysis and only gets the fields that are needed in the pipeline.

Here is an example, shown via explain:

db.foo.explain().aggregate({$group:{_id:"$fieldA", count:{$sum:1}}})
{
  "stages" : [
    {
        "$cursor" : {
            "query" : {

            },
            "fields" : {
                "fieldA" : 1,
                "_id" : 0
            },
            "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "snv.foo",
                "indexFilterSet" : false,
                "parsedQuery" : {

                },
                "winningPlan" : {
                    "stage" : "EOF"
                },
                "rejectedPlans" : [ ]
            }
        }
    },
    {
        "$group" : {
            "_id" : "$fieldA",
            "count" : {
                "$sum" : {
                    "$const" : 1
                }
            }
        }
    }
  ]
}

Even though I have no projection, you can see that only fieldA is being returned to the rest of the pipeline.

The only scenarios where it's necessary to add early $projectstage would be to work around a bug or limitation in aggregation's own dependency analysis, but it should be avoided routinely.

Upvotes: 3

prasad_
prasad_

Reputation: 14317

When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.

Yes, this is correct. Each aggregation stage has a memory limit, and working within this limit will ensure that there will not be performance issues.

From the manual (pipeline memory restrictions):

Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the allowDiskUse option to enable aggregation pipeline stages to write data to temporary files.

When the query uses the allowDiskUse it will affect the performance, as using disk is very much slower than the memory.

In addition, here are some important practices:

  • Use as less stages as possible; more stages means that the documents need to be examined at each stage. This is additional processing and resources.
  • Avoid unnecessary $project stages.
  • Specifying certain stages at the beginning of the pipeline is important. The stages, $match and $sort can use indexes only in the initial stages. Also, $match and $limit can reduce the number of documents to process down the pipeline if used early in the pipeline.
  • Use explain and study the query plans to find if indexes are being used (in $match and $sort) or any indexes need defining for query optimization. Note that indexes work little bit differently with aggregations and also not all stages can use indexes.

As such aggregation framework can automatically re-order some stages for optimization. For further details see this documentation on aggregation pipeline optimization.

Upvotes: -1

Related Questions