crazydiv
crazydiv

Reputation: 842

Understanding performance: mongo aggregation vs count

If I do a count query, I get the results in <2seconds

db.coll.find({"A":1,"createDate":{"$gt":new Date("2011-05-21"),"$lt":new Date("2013-08-21")}}).count()

This uses the following index

db.coll.ensureIndex({"A":1,"createDate":1})

Similarly, there are 4 columns A,B,C,D(values are always 0 or 1) for which I run 4 count queries and get results in <10seconds.

I looked at the aggregation framework documentation and created an aggregated query to do all 4 sums together.

db.coll.aggregate(  { $match : {"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} } },
{ $group :
                         { _id:null,
                         totalA : { $sum : "$A" },
                         totalB : {$sum: "$B},
                         totalC:{$sum: "$C"},
                         totalD:{$sum: "$D"}}} 
 ) 

I also created an index:

db.coll..ensureIndex({"createDate":1,"A":1,"B":1,"C":1,"D":1})

According to the documentation, this index covers my aggregate function. But the return of the aggregate is in ~18seconds.

I'm confused here. Is there anything basic which I missed or is there any fundamental concept lying behind which makes aggregation slower than count. I am also concerned about the overhead due to number of queries to be fired from mongo from the code for fetching count.

Upvotes: 5

Views: 5933

Answers (2)

user508434
user508434

Reputation:

Short answer: in your case aggregation is slower because it involves more data processing while using an index mongo efficiently can calculate counts. Aggregation is meant for computing some complex results (grouping etc) while for simple counting count() suffices.

The reason for this is that aggregation in mongodb is a framework for aggregating data and is based on the concept of the data processing pipelines. Mongo logically passes the entire collection into the aggregation pipeline. This is the reason there is no explain for aggregation as a whole (as of this writing, version 2.4). This means there is basically one access method and the rest of the time is used for processing. But it seems there is a support for explain in recent versions.

You can do an early filtering in which subset of the data in a collection is used.

Early Filtering

If your aggregation operation requires only a subset of the data in a collection, use the $match, $limit, and $skip stages to restrict the documents that enter at the beginning of the pipeline. When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.

Placing a $match pipeline stage followed by a $sort stage at the start of the pipeline is logically equivalent to a single query with a sort and can use an index. When possible, place $match operators at the beginning of the pipeline.

Aggregation pipeline behaviour.

Upvotes: 4

Neil Lunn
Neil Lunn

Reputation: 151112

Firstly, though not documented for 2.4.8 you can run an explain using the db.runCommand invocation:

db.runCommand({
    aggregate: "coll",
    pipeline: [      
        { $match : 
            {"createDate":{$gt:new Date("2013-05-21"),$lt:new Date("2013-08-21")} } 
        },
        { $group : { 
              _id:null,
              totalA: {$sum :"$A"},
              totalB: {$sum: "$B"},
              totalC: {$sum: "$C"},
              totalD: {$sum: "$D"}
        }} 
    ],
    explain: true
})

Which will give you some insight into what is happening.

Also, and primarily, you are comparing apples to oranges.

When you issue a count() on a query, it is using the cursor result properties to get the number of documents that matched.

Under aggregation, you are selecting an extended match and then compacting all of those results into a sum of all the items. If your initial $match has lots of results, then all of these need to be crunched together with $sum.

Have a look at explain, and try to conceptually understand the differences. Aggregation is great for what you generally want it to do. But maybe this isn't the best use case.

Upvotes: 8

Related Questions