Sarath Nair
Sarath Nair

Reputation: 2868

Speed up MongoDB aggregation

I have a sharded collection "my_collection" with the following structure:

{ 
   "CREATED_DATE" : ISODate(...),
   "MESSAGE" : "Test Message",
   "LOG_TYPE": "EVENT"
}

The mongoDB environment is sharded with 2 shards. The above collection is sharded using Hashed shard key on LOG_TYPE. There are 7 more other possibilities for LOG_TYPE attribute.

I have 1 million documents in "my_collection" and I am trying to find the count of documents based on the LOG_TYPE using the following query:

db.my_collection.aggregate([
    { "$group" :{ 
        "_id": "$LOG_TYPE",
        "COUNT": { "$sum":1 }
    }}
])

But this is getting me result in about 3 seconds. Is there any way to improve this? Also when I run the explain command, it shows that no Index has been used. Does the group command doesn't use an Index?

Upvotes: 3

Views: 10912

Answers (2)

Asya Kamsky
Asya Kamsky

Reputation: 42352

There are currently some limitations in what aggregation framework can do to improve the performance of your query, but you can help it the following way:

db.my_collection.aggregate([
    { "$sort" : { "LOG_TYPE" : 1 } },
    { "$group" :{ 
        "_id": "$LOG_TYPE",
        "COUNT": { "$sum":1 }
    }}
])

By adding a sort on LOG_TYPE you will be "forcing" the optimizer to use an index on LOG_TYPE to get the documents in order. This will improve the performance in several ways, but differently depending on the version being used.

On real data if you have the data coming into the $group stage sorted, it will improve the efficiency of accumulation of the totals. You can see the different query plans where with $sort it will use the shard key index. The improvement this gives in actual performance will depend on the number of values in each "bucket" - in general LOG_TYPE having only seven distinct values makes it an extremely poor shard key, but it does mean that it all likelihood the following code will be a lot faster than even optimized aggregation:

db.my_collection.distinct("LOG_TYPE").forEach(function(lt) {
   print(db.my_collection.count({"LOG_TYPE":lt});
});

Upvotes: 8

Sammaye
Sammaye

Reputation: 43884

There are a limited number of things that you can do in MongoDB, at the end of the day this might be a physical problem that extends beyond MongoDB itself, maybe latency causing configsrvs to respond untimely or results to be brought back from shards too slowly.

However you might be able to solve some performane problems by using a covered query. Since you are in fact sharding on LOG_TYPE you will already have an index on it (required before you can shard on it), not only that but the aggregation framework will auto add projection so that won't help.

MongoDB is likely having to communicate to every shard for the results, otherwise called a scatter and gather operation.

$group on its own will not use an index.

This is my results on 2.4.9:

> db.t.ensureIndex({log_type:1})
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
        "serverPipeline" : [
                {
                        "query" : {

                        },
                        "projection" : {
                                "log_type" : 1,
                                "_id" : 0
                        },
                        "cursor" : {
                                "cursor" : "BasicCursor",
                                "isMultiKey" : false,
                                "n" : 1,
                                "nscannedObjects" : 1,
                                "nscanned" : 1,
                                "nscannedObjectsAllPlans" : 1,
                                "nscannedAllPlans" : 1,
                                "scanAndOrder" : false,
                                "indexOnly" : false,
                                "nYields" : 0,
                                "nChunkSkips" : 0,
                                "millis" : 0,
                                "indexBounds" : {

                                },
                                "allPlans" : [
                                        {
                                                "cursor" : "BasicCursor",
                                                "n" : 1,
                                                "nscannedObjects" : 1,
                                                "nscanned" : 1,
                                                "indexBounds" : {

                                                }
                                        }
                                ],
                                "server" : "ubuntu:27017"
                        }
                },
                {
                        "$group" : {
                                "_id" : "$log_type"
                        }
                }
        ],
        "ok" : 1
}

This is the result from 2.6:

> use gthtg
switched to db gthtg
> db.t.insert({log_type:"fdf"})
WriteResult({ "nInserted" : 1 })
> db.t.ensureIndex({log_type: 1})
{ "numIndexesBefore" : 2, "note" : "all indexes already exist", "ok" : 1 }
> db.t.runCommand("aggregate", {pipeline: [{$group:{_id:'$log_type'}}], explain: true})
{
        "stages" : [
                {
                        "$cursor" : {
                                "query" : {

                                },
                                "fields" : {
                                        "log_type" : 1,
                                        "_id" : 0
                                },
                                "plan" : {
                                        "cursor" : "BasicCursor",
                                        "isMultiKey" : false,
                                        "scanAndOrder" : false,
                                        "allPlans" : [
                                                {
                                                        "cursor" : "BasicCursor",
                                                        "isMultiKey" : false,
                                                        "scanAndOrder" : false
                                                }
                                        ]
                                }
                        }
                },
                {
                        "$group" : {
                                "_id" : "$log_type"
                        }
                }
        ],
        "ok" : 1
}

Upvotes: 0

Related Questions