Mongodb - grouping several fields using aggregation framework

Question

I have some documents

{name: 'apple',        type: 'fruit',  color: 'red'}
{name: 'banana',       type: 'fruit',  color: 'yellow'}
{name: 'orange',       type: 'fruit',  color: 'orange'}
{name: 'eggplant',     type: 'vege',   color: 'purple'}
{name: 'brocoli',      type: 'vege',   color: 'green'}
{name: 'rose',         type: 'flower', color: 'red'}
{name: 'cauli',        type: 'vege',   color: 'white'}
{name: 'potato',       type: 'vege',   color: 'brown'}
{name: 'onion',        type: 'vege',   color: 'white'}
{name: 'strawberry',   type: 'fruit',  color: 'red'}
{name: 'cashew',       type: 'nut',    color: ''}
{name: 'almond',       type: 'nut',    color: ''}
{name: 'lemon',        type: 'vege',   color: 'yellow'}
{name: 'tomato',       type: 'vege',   color: 'red'}
{name: 'tomato',       type: 'fruit',  color: 'red'}
{name: 'fig',          type: 'fruit',  color: 'pink'}
{name: 'nectarin',     type: 'fruit',  color: 'pink'}

I want to group them into alphabets like below

{
 _id:'a',
 name:['apple','almond'],
 type:[],
 color:[]
}

{
 _id:'b',
 name:['banana','brocoli'],
 type:[],
 color:['brown']
}

...

{
 _id:'f',
 name:['fig'],
 type:['fruit','flower'],
 color:['']
}

...

{
 _id:'n',
 name:['nectarin'],
 type:['nut'],
 color:['']
}

...

{
 _id:'p',
 name:['potato'],
 type:[''],
 color:['pink','purple']
}

...

The result can be saved into another collection. So I can issue a query in the newly created collection: find({_id:'a'}) to return name, type and color begins with the letter 'a'.

I have thought about using $group

$group: {
  _id: $substr: ['$name', 0, 1],
  name: {$addToSet: '$name'},
}

Then another command

$group: {
  _id: $substr: ['$type', 0, 1],
  name: {$addToSet: '$type'},
}

And

$group: {
  _id: $substr: ['$color', 0, 1],
  name: {$addToSet: '$color'},
}

But I am stuck at how to unify all three together to save into a new collection. Or is aggregation framework not suitable for this kind of data summary?

In a real world example, e.g. a e-commerce site, the front page displays something like: "currently we have 135636 products under 231 categories from 111 brands". Surely, these numbers should be cached somewhere (in memory or in another collection), because running $group each time is resource intensive? What would be the optimal schema/design for these situations?

Sorry, my questions are a bit 'confusing'.

Blakes Seven · Accepted Answer

Since you have multiple arrays here, the key thing is to "merge" them all into one for the simplest processing.

The $map operator of the aggregation framework works well here, as well as transforming elements so that you get your "first letter" from each word within the data:

db.alpha.aggregate([
  { "$project": {
    "list": {
      "$map": { 
        "input": [ "A", "B", "C" ],
        "as": "el",
        "in": {
          "$cond": [
            { "$eq": [ "$$el", "A" ] },
            { 
              "type": { "$literal": "name" }, 
              "value": "$name",
              "alpha": { "$substr": [ "$name",0,1 ] } 
            },
            { "$cond": [
              { "$eq": [ "$$el", "B" ] },
              {
                "type": { "$literal": "type" }, 
                "value": "$type",
                "alpha": { "$substr": [ "$type",0,1 ] } 
              },
              {
                "type": { "$literal": "color" }, 
                "value": "$color",
                "alpha": { "$substr": [ "$color",0,1 ] } 
              }
            ]}
          ]
        }
      }
    }
  }},
  { "$unwind": "$list" },
  { "$match": { "list.alpha": { "$ne": "" } } },
  { "$group": {
    "_id": "$list.alpha",
    "list": { 
      "$addToSet": "$list"
    }
  }},
  { "$project": {
    "name": { 
      "$setDifference": [
        { "$map": {
          "input": "$list",
          "as": "el",
          "in": {
            "$cond": [
              { "$eq": [ "$$el.type", "name" ] },
              "$$el.value",
              false
            ]
          }
        }},
        [false]
      ]
    },
    "type": { 
      "$setDifference": [
        { "$map": {
          "input": "$list",
          "as": "el",
          "in": {
            "$cond": [
              { "$eq": [ "$$el.type", "type" ] },
              "$$el.value",
              false
            ]
          }
        }},
        [false]
      ]
    },
    "color": { 
      "$setDifference": [
        { "$map": {
          "input": "$list",
          "as": "el",
          "in": {
            "$cond": [
              { "$eq": [ "$$el.type", "color" ] },
              "$$el.value",
              false
            ]
          }
        }},
        [false]
      ]
    }
  }},
  { "$sort": { "_id": 1 } }
])

If you look at the data in "stages" it makes a lot of sense what is happening here in the tranformation.

The first stage "maps" all of the fields into a single array per document, so all documents now look like this:

{
    "_id" : ObjectId("55df0652c9064ef625d7f36e"),
    "list" : [
            {
                    "type" : "name",
                    "value" : "nectarin",
                    "alpha" : "n"
            },
            {
                    "type" : "type",
                    "value" : "fruit",
                    "alpha" : "f"
            },
            {
                    "type" : "color",
                    "value" : "pink",
                    "alpha" : "p"
            }
    ]
}

The $unwind is of little consequence, as it does the standard and creates new documents from each member. It is the $group that does most of the work here with this result per "alpha" in the grouping:

{
    "_id" : "o",
    "list" : [
            {
                    "type" : "name",
                    "value" : "orange",
                    "alpha" : "o"
            },
            {
                    "type" : "color",
                    "value" : "orange",
                    "alpha" : "o"
            },
            {
                    "type" : "name",
                    "value" : "onion",
                    "alpha" : "o"
            }
    ]
}

That has a nice grouping, and is arguably a decent output format. But in order to get to the end results then the $map operator is employed again alongside $setDifference which can be used to remove the false values where each field "type" conversion does not match the required output field.

The full result is:

{ "_id" : "a", "name" : [ "almond", "apple" ], "type" : [ ], "color" : [ ] }
{ "_id" : "b", "name" : [ "brocoli", "banana" ], "type" : [ ], "color" : [ "brown" ] }
{ "_id" : "c", "name" : [ "cashew", "cauli" ], "type" : [ ], "color" : [ ] }
{ "_id" : "e", "name" : [ "eggplant" ], "type" : [ ], "color" : [ ] }
{ "_id" : "f", "name" : [ "fig" ], "type" : [ "flower", "fruit" ], "color" : [ ] }
{ "_id" : "g", "name" : [ ], "type" : [ ], "color" : [ "green" ] }
{ "_id" : "l", "name" : [ "lemon" ], "type" : [ ], "color" : [ ] }
{ "_id" : "n", "name" : [ "nectarin" ], "type" : [ "nut" ], "color" : [ ] }
{ "_id" : "o", "name" : [ "onion", "orange" ], "type" : [ ], "color" : [ "orange" ] }
{ "_id" : "p", "name" : [ "potato" ], "type" : [ ], "color" : [ "pink", "purple" ] }
{ "_id" : "r", "name" : [ "rose" ], "type" : [ ], "color" : [ "red" ] }
{ "_id" : "s", "name" : [ "strawberry" ], "type" : [ ], "color" : [ ] }
{ "_id" : "t", "name" : [ "tomato" ], "type" : [ ], "color" : [ ] }
{ "_id" : "v", "name" : [ ], "type" : [ "vege" ], "color" : [ ] }
{ "_id" : "w", "name" : [ ], "type" : [ ], "color" : [ "white" ] }
{ "_id" : "y", "name" : [ ], "type" : [ ], "color" : [ "yellow" ] }

Where everything is grouped alphabetically and with their own arrays for each field.

Upcoming releases of MongoDB will have a $filter that makes the $map and $setDifference combination a bit nicer. But that does not make "sets", not that it matters much to this process as long as $addToSet is employed where it is.

Thinking about this, I would like to "advise" that considering the amount of data you want to process here that the resulting "arrays" for each letter might just possibly exceed the BSON limits depending on how many distinct "words" there actually are.

In which case the "advice" here would be follow the process right up to and including the $match, but then only $group afterwards like this:

  { "$group": {
    "_id": { 
      "alpha": "$list.alpha",
      "type": "$list.type",
      "value": "$list.value",
    }
  }},
  { "$sort": { "_id": 1 } }

It's longer output of course, but will not exceed the BSON limit for documents at any stage.

Mongodb - grouping several fields using aggregation framework

Answers (2)

Related Questions