Reputation: 842
If I do a count query, I get the results in <2seconds
db.coll.find({"A":1,"createDate":{"$gt":new Date("2011-05-21"),"$lt":new Date("2013-08-21")}}).count()
This uses the following index
db.coll.ensureIndex({"A":1,"createDate":1})
Similarly, there are 4 columns A,B,C,D(values are always 0 or 1) for which I run 4 count queries and get results in <10seconds.
I looked at the aggregation framework documentation and created an aggregated query to do all 4 sums together.
db.coll.aggregate( { $match : {"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} } },
{ $group :
{ _id:null,
totalA : { $sum : "$A" },
totalB : {$sum: "$B},
totalC:{$sum: "$C"},
totalD:{$sum: "$D"}}}
)
I also created an index:
db.coll..ensureIndex({"createDate":1,"A":1,"B":1,"C":1,"D":1})
According to the documentation, this index covers my aggregate function. But the return of the aggregate is in ~18seconds.
I'm confused here. Is there anything basic which I missed or is there any fundamental concept lying behind which makes aggregation slower than count. I am also concerned about the overhead due to number of queries to be fired from mongo from the code for fetching count.
Upvotes: 5
Views: 5933
Reputation:
Short answer: in your case aggregation is slower because it involves more data processing while using an index mongo efficiently can calculate counts. Aggregation is meant for computing some complex results (grouping
etc) while for simple counting count()
suffices.
The reason for this is that aggregation in mongodb is a framework for aggregating data and is based on the concept of the data processing pipelines. Mongo logically passes the entire collection into the aggregation pipeline. This is the reason there is no explain
for aggregation as a whole (as of this writing, version 2.4
). This means there is basically one access method and the rest of the time is used for processing. But it seems there is a support for explain
in recent versions.
You can do an early filtering in which subset of the data in a collection is used.
Early Filtering
If your aggregation operation requires only a subset of the data in a collection, use the
$match
,$limit
, and$skip
stages to restrict the documents that enter at the beginning of the pipeline. When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.Placing a $match pipeline stage followed by a
$sort
stage at the start of the pipeline is logically equivalent to a single query with a sort and can use an index. When possible, place $match operators at the beginning of the pipeline.
Aggregation pipeline behaviour.
Upvotes: 4
Reputation: 151112
Firstly, though not documented for 2.4.8 you can run an explain using the db.runCommand
invocation:
db.runCommand({
aggregate: "coll",
pipeline: [
{ $match :
{"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} }
},
{ $group : {
_id:null,
totalA: {$sum :"$A"},
totalB: {$sum: "$B"},
totalC: {$sum: "$C"},
totalD: {$sum: "$D"}
}}
],
explain: true
})
Which will give you some insight into what is happening.
Also, and primarily, you are comparing apples to oranges.
When you issue a count()
on a query, it is using the cursor result properties to get the number of documents that matched.
Under aggregation, you are selecting an extended match and then compacting all of those results into a sum of all the items. If your initial $match has lots of results, then all of these need to be crunched together with $sum.
Have a look at explain, and try to conceptually understand the differences. Aggregation is great for what you generally want it to do. But maybe this isn't the best use case.
Upvotes: 8