Reputation: 7998
I'm running the standard Homebrew installation of Mongo DB, version 2.4.6, and I've got a database with a collection called 'items', which has 600k documents within it.
I've written the following query to find the the top five brands for the collection of items:
db.items.aggregate([
{ $group: { _id: '$brand', size: { $sum: 1}}},
{ $sort: {"size": -1}},
{ $limit: 5}
])
which returns the result I expected, but to be frank, takes much longer to complete than I ever would have imagined. Here is the profile data:
{
"op" : "command",
"ns" : "insights-development.$cmd",
"command" : {
"aggregate" : "items",
"pipeline" : [
{
"$group" : {
"_id" : "$brand",
"size" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"size" : -1
}
},
{
"$limit" : 5
}
]
},
"ntoreturn" : 1,
"keyUpdates" : 0,
"numYield" : 3,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(3581974),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(1314151),
"w" : NumberLong(10)
}
},
"responseLength" : 267,
"millis" : 2275,
"ts" : ISODate("2013-11-23T18:16:33.886Z"),
"client" : "127.0.0.1",
"allUsers" : [ ],
"user" : ""
}
Here is the ouptut of db.items.stats()
:
{
"sharded" : false,
"primary" : "a59aff30810b066bbe31d1fae79596af",
"ns" : "insights-development.items",
"count" : 640590,
"size" : 454491840,
"avgObjSize" : 709.4894394230319,
"storageSize" : 576061440,
"numExtents" : 14,
"nindexes" : 10,
"lastExtentSize" : 156225536,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 165923744,
"indexSizes" : {
"_id_" : 17889088,
"demographic_1" : 14741328,
"brand_1" : 17946320,
"retailer_1" : 18690336,
"color_1" : 15738800,
"style_1" : 18951968,
"classification_1" : 15019312,
"placement_1" : 19107312,
"state_1" : 12394816,
"gender_1" : 15444464
},
"ok" : 1
}
I'm fairly new to MongoDB so I'm hoping someone can point out why this aggregation takes so long to run and if there is anything I can do to speed it up as it seems to me that 600k isn't a huge number of documents more Mongo to run calculations on.
Upvotes: 1
Views: 699
Reputation: 42352
If you have an index on "brand" field, then adding a {$sort:{brand:1}}
at the beginning of the pipeline may help performance. The reason you're not seeing good performance right now is likely due to the need to scan every document to group by brand. If there was an index, then it could be used to scan index only rather than all the documents. And sorting (which uses an index) can speed up grouping in some cases where having a result ordered by the field being grouped is beneficial.
If you created an index on brand
and didn't see any improvement, try adding a $sort
before you get rid of the index. If it happens that you already have an index where brand
is the first field, you then don't need to add another index on brand
- the compound index will automatically be used.
Upvotes: 2