Dmitrii Sidenko
Dmitrii Sidenko

Reputation: 658

How to make pymongo aggregation with count all elements and grouping by one request

I have a collection with fields like this:

{
    "_id":"5cf54857bbc85fd0ff5640ba",
    "book_id":"5cf172220fb516f706d00591",
    "tags":{
        "person":[
            {"start_match":209, "length_match":6, "word":"kimmel"}
        ],
        "organization":[
            {"start_match":107, "length_match":12, "word":"philadelphia"},
            {"start_match":209, "length_match":13, "word":"kimmel center"}
        ],
        "location":[
            {"start_match":107, "length_match":12, "word":"philadelphia"}
        ]
    },
    "deleted":false
}

I want to collect the different words in the categories and count it. So, the output should be like this:

{
    "response": [
        {
            "tag": "location",
            "tag_list": [
                {
                    "count": 31,
                    "phrase": "philadelphia"
                },
                {
                    "count": 15,
                    "phrase": "usa"
                }
             ]
        },
        {
            "tag": "organization",
            "tag_list": [ ... ]
        },
        {
            "tag": "person",
            "tag_list": [ ... ]
        },
    ]
}

The pipeline like this works:

def pipeline_func(tag):
    return [
        {'$replaceRoot': {'newRoot': '$tags'}},
        {'$unwind': '${}'.format(tag)},
        {'$group': {'_id': '${}.word'.format(tag), 'count': {'$sum': 1}}},
        {'$project': {'phrase': '$_id', 'count': 1, '_id': 0}},
        {'$sort': {'count': -1}}
    ]

But it make a request for each tag. I want to know how to make it in one request. Thank you for attention.

Upvotes: 1

Views: 1619

Answers (1)

Neil Lunn
Neil Lunn

Reputation: 151092

As noted, there is a slight mismatch in the question data to the current claimed pipeline process since $unwind can only be used on arrays and the tags as presented in the question is not an array.

For the data presented in the question you basically want a pipeline like this:

db.collection.aggregate([
  { "$addFields": {
    "tags": { "$objectToArray": "$tags" }
  }},
  { "$unwind": "$tags" },
  { "$unwind": "$tags.v" },
  { "$group": {
    "_id": {
      "tag": "$tags.k",
      "phrase": "$tags.v.word"
    },
    "count": { "$sum": 1 }
  }},
  { "$group": {
    "_id": "$_id.tag",
    "tag_list": {
      "$push": {
        "count": "$count",
        "phrase": "$_id.phrase"
      }
    }
  }}
])

Again as per the note, since tags is in fact an object then what you actually need in order to collect data based on it's sub-keys as the question is asking, is to turn that essentially into an array of items.

The usage of $replaceRoot in your current pipeline would seem to indicate that $objectToArray is of fair use here, as it is available from later patch releases of MongoDB 3.4, being the bare minimal version you should be using in production right now.

That $objectToArray actually does pretty much what the name says and produces an array ( or "list" to be more pythonic ) of entries broken into key and value pairs. These are essentially a "list" of objects ( or "dict" entries ) which have the keys k and v respectively. The output of the first pipeline stage would look like this on the supplied document:

{
  "book_id": "5cf172220fb516f706d00591",
  "tags": [
    {
      "k": "person",
      "v": [
        {
          "start_match": 209,
          "length_match": 6,
          "word": "kimmel"
        }
      ]
    }, {
      "k": "organization",
      "v": [
        {
          "start_match": 107,
          "length_match": 12,
          "word": "philadelphia"
        }, {
          "start_match": 209,
          "length_match": 13,
          "word": "kimmel center"
        }
      ]
    }, {
      "k": "location",
      "v": [
        {
          "start_match": 107,
          "length_match": 12,
          "word": "philadelphia"
        }
      ]
    }
  ],
  "deleted" : false
}

So you should be able to see how you can now easily access those k values and use them in grouping, and of course the v is the standard array as well. So it's just the two $unwind stages as shown and then two $group stages. Being the first $group in order to collection over the combination of keys, and the second to collect as per the main grouping key whilst adding the other accumulations to a "list" within that entry.

Of course the output by the above listing is not exactly how you asked for in the question, but the data is basically there. You can optionally add an $addFields or $project stage to essentially rename the _id key as the final aggregation stage:

  { "$addFields": {
    "_id": "$$REMOVE",
    "tag": "$_id"
  }}

Or simply do something pythonic with a little list comprehension on the cursor output:

cursor = db.collection.aggregate([
  { "$addFields": {
    "tags": { "$objectToArray": "$tags" }
  }},
  { "$unwind": "$tags" },
  { "$unwind": "$tags.v" },
  { "$group": {
    "_id": {
      "tag": "$tags.k",
      "phrase": "$tags.v.word"
    },
    "count": { "$sum": 1 }
  }},
  { "$group": {
    "_id": "$_id.tag",
    "tag_list": {
      "$push": {
        "count": "$count",
        "phrase": "$_id.phrase"
      }
    }
  }}
])

output = [{ 'tag': doc['_id'], 'tag_list': doc['tag_list'] } for doc in cursor]

print({ 'response': output });

And final output as a "list" you can use for response:

{
  "tag_list": [
    {
      "count": 1,
      "phrase": "philadelphia"
    }
  ],
  "tag": "location"
},
{
  "tag_list": [
    {
      "count": 1,
      "phrase": "kimmel"
    }
  ],
  "tag": "person"
},
{
  "tag_list": [
    {
      "count": 1,
      "phrase": "kimmel center"
    }, {
      "count": 1,
      "phrase": "philadelphia"
    }
  ],
  "tag": "organization"
}

Noting that using a list comprehension approach you have a bit more control over the order of "keys" as output, as MongoDB itself would simply append NEW key names in a projection keeping existing keys ordered first. If that sort of thing is important to you that is. Though it really should not be since all Object/Dict like structures should not be considered to have any set order of keys. That's what arrays ( or lists ) are for.

Upvotes: 1

Related Questions