Juicy
Juicy

Reputation: 12500

MongoDB aggregation, find number of distinct values in documents' arrays

Reading the docs, I see you can get the number of elements in document arrays. For example given the following documents:

{ "_id" : 1, "item" : "ABC1", "description" : "product 1", colors: [ "blue", "black", "red" ] }
{ "_id" : 2, "item" : "ABC2", "description" : "product 2", colors: [ "purple" ] }
{ "_id" : 3, "item" : "XYZ1", "description" : "product 3", colors: [ ] }

and the following query:

db.inventory.aggregate([{$project: {item: 1, numberOfColors: { $size: "$colors" }}}])

We would get the number of elements in each document's colors array:

{ "_id" : 1, "item" : "ABC1", "numberOfColors" : 3 }
{ "_id" : 2, "item" : "ABC2", "numberOfColors" : 1 }
{ "_id" : 3, "item" : "XYZ1", "numberOfColors" : 0 }

I've not been able to figure out if and how you could sum up all the colors in all the documents directly from a query, ie:

{ "totalColors": 4 }

Upvotes: 2

Views: 1528

Answers (2)

Neil Lunn
Neil Lunn

Reputation: 151092

Infinitely better is is to simply $sum the $size:

db.inventory.aggregate([
  { "$group": { "_id": null, "totalColors": { "$sum": { "$size": "$colors" } } }
])

If you wanted "distinct in each document" then you would instead:

db.inventory.aggregate([
  { "$group": {
    "_id": null,
    "totalColors": {
      "$sum": {
        "$size": { "$setUnion": [ [], "$colors" ] }
      }
    }
  }}
])

Where $setUnion takes values likes ["purple","blue","purple"] and makes it into ["purple","blue"] as a "set" with "distinct items".

And if you really want "distinct across documents" then don't accumulate the "distinct" into a single document. That causes performance issues and simply does not scale to large data sets, and can possibly break the 16MB BSON Limit. Instead accumulate naturally via the key:

db.inventory.aggregate([
  { "$unwind": "$colors" },
  { "$group": { "_id": "$colors" } },
  { "$group": { "_id": null, "totalColors": { "$sum": 1 } } }
])

Where you only use $unwind because you want "distinct" values from the array as combined with other documents. Generally $unwind should be avoided unless the value contained in the array is being accessed in the "grouping key" _id of $group. Where it is not, it is better to treat arrays using other operators instead, since $unwind creates a "copy" of the whole document per array element.

And of course there was also nothing wrong with simply using .distinct() here, which will return the "distinct" values "as an array", for which you can just test the Array.length() on in code:

var totalSize = db.inventory.distinct("colors").length;

Which for the simple operation you are asking, would be the overall fastest approach for a simple "count of distinct elements". Of course the limitation remains that the result cannot exceed the 16MB BSON limit as a payload. Which is where you defer to .aggregate() instead.

Upvotes: 2

Ivan Manzhos
Ivan Manzhos

Reputation: 743

You can use the following query to get the count of all colors in all docs:

db.inventory.aggregate([ 
  { $unwind: '$colors' } , // expands nested array so we have one doc per each array value 
  { $group: {_id: null, allColors: {$addToSet: "$colors"} } }  , // find all colors
  { $project: { totalColors: {$size: "$allColors"}}} // find count of all colors
 ])

Upvotes: 2

Related Questions