Reputation: 18929
A collection may have multiple columns that each contain substantial data like article content or image data etc.
When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.
A trivial example: we have a requirement to find all articles
that don't have a matching author from an authors
collection. I would assume that we would not project unnecessary article fields onwards. The same with the $lookup
to authors where we only need and id for this purpose. Demo:
db.getCollection("articles").aggregate(
[
{
$match: {
somecolumn: { "$ne": null, $exists: true }
}
},
{
$project: {
id: 1,
authorId: 1
}
},
{
$lookup: {
from: "authors",
let: { author: "$authorId" },
pipeline: [
{
$match: {
$expr:
{
$eq: ["$$author","$id"] }
}
},
{ $project: { id: 1, } }
],
as: "author"
}
},
{
$match: {
"author.0": {$exists: false}
}
}
]
);
Am I correct in this assumption or do the internal processes work differently?
Upvotes: 2
Views: 1656
Reputation: 42352
In general, the only place you want to have $project is as the very last stage in order to return only the needed fields back to the client, in some cases maybe renaming or recomputing them.
You do not need to have $project earlier in the pipeline to "trim down the fields" because the aggregation already does dependency analysis and only gets the fields that are needed in the pipeline.
Here is an example, shown via explain
:
db.foo.explain().aggregate({$group:{_id:"$fieldA", count:{$sum:1}}})
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"fieldA" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "snv.foo",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$fieldA",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
]
}
Even though I have no projection, you can see that only fieldA
is being returned to the rest of the pipeline.
The only scenarios where it's necessary to add early $project
stage would be to work around a bug or limitation in aggregation's own dependency analysis, but it should be avoided routinely.
Upvotes: 3
Reputation: 14317
When using aggregation pipeline stages, I would assume there would be benefits to trim down the fields using projection, so we only pass required fields to subsequent stages to help with memory usage.
Yes, this is correct. Each aggregation stage has a memory limit, and working within this limit will ensure that there will not be performance issues.
From the manual (pipeline memory restrictions):
Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. To allow for the handling of large datasets, use the
allowDiskUse
option to enable aggregation pipeline stages to write data to temporary files.
When the query uses the allowDiskUse
it will affect the performance, as using disk is very much slower than the memory.
In addition, here are some important practices:
$project
stages.$match
and $sort
can use indexes only in the initial stages. Also, $match
and $limit
can reduce the number of documents to process down the pipeline if used early in the pipeline.explain
and study the query plans to find if indexes are being used (in $match
and $sort
) or any indexes need defining for query optimization. Note that indexes work little bit differently with aggregations and also not all stages can use indexes.As such aggregation framework can automatically re-order some stages for optimization. For further details see this documentation on aggregation pipeline optimization.
Upvotes: -1