Reputation: 447
I have a query:
db.test.aggregate( {$group : { _id : '$key', frequency: { $sum : 1 } } } )
This will get the frequency of every enumeration of key in the test set. Basically, I have gotten the distribution of key.
Now imagine I want to get the distributions of key1, key2, and key3 (so three different distributions).
Obviously, I could run this query 3 times with each separate key, but it seems like we would be able to optimize the query by allowing it to count all 3 keys at the same time. I have been playing around with it and searching the whole of the inter-webs, but so far, I am consigned to running three separate aggregation queries or using a map/reduce function.
Does anyone have any other ideas?
Upvotes: 9
Views: 7955
Reputation: 1164
There are a few different approaches you could use here:
Use map/reduce: don't do this. Right now it would be much faster to run the aggregation framework 3 times than to use a map reduce function for this use case.
Run aggregation 3 times. This is not optimal, but if you don't have time constraints then this is the easiest option. If your aggregations are taking < a few seconds anyway then I wouldn't worry about optimizing until they become a problem.
Here's the best work-around I can think of. The $group
operator allows you to build an _id
on multiple fields. E.g. {"_id":{"a":"$key1", "b":"$key2", "c":"$key3"}}
. Doing this creates a grouping for all existing combinations of your different keys. You could potentially group you keys this way and then manually sum across the results in the client.
Let me elaborate. Let's say we have a collection of shapes. These shapes can have a color, a size, and a kind (square, circle, etc). An aggregation on a multi-key Id could look like:
db.shapes.aggregate({$group:{_id:{"f1":"$f1", "f2":"$f2", "f3":"$f3"}, count:{"$sum":1}}})
and return:
"result" : [
{
"_id" : {
"f1" : "yellow",
"f2" : "medium",
"f3" : "triangle"
},
"count" : 4086
},
{
"_id" : {
"f1" : "red",
"f2" : "small",
"f3" : "triangle"
},
"count" : 4138
},
{
"_id" : {
"f1" : "red",
"f2" : "big",
"f3" : "square"
},
"count" : 4113
},
{
"_id" : {
"f1" : "yellow",
"f2" : "small",
"f3" : "triangle"
},
"count" : 4145
},
{
"_id" : {
"f1" : "red",
"f2" : "small",
"f3" : "square"
},
"count" : 4062
}
... and so on
You would then sum up the results client-side, over a drastically reduced number of entries. Assuming the number of unique values for each key is sufficiently small compared to the total number of documents, you could do this final step in a negligible amount of time.
Upvotes: 6