Nitin Kumar
Nitin Kumar

Reputation: 775

Aggregation in mongodb for nested documents

I have a document in the following format:

"summary":{
     "HUL":{
        "hr_0":{
            "ts":None,
            "Insights":{
                "sentiments":{
                    "pos":37,
                    "neg":3,
                    "neu":27
                    },
                "topics":[
                    "Basketball",
                    "Football"
                    ],
                "geo":{
                    "locations":{
                        "Delhi":34,
                        "Kolkata":56,
                        "Pune":79,
                        "Bangalore":92,
                        "Mumbai":54
                        },
                    "mst_act":{
                        "loc":Bangalore, 
                        "lat_long":None
                        }
                    }
                }
            },
        "hr_1":{....},
        "hr_2":{....},
         .
         .
        "hr_23":{....}

I want to run an aggregation in pymongo that sums up the pos, neg and neu sentiments for all hours of the day "hr_0" to "hr_23".

I am having trouble in constructing the pipeline command in order to do this as the fields I am interested in are in nested dictionaries. Would really appreciate your suggestions.

Thanks!

Upvotes: 2

Views: 598

Answers (1)

chridam
chridam

Reputation: 103455

It's going to be pretty difficult to come up with an aggregation pipeline that will give you the desired aggregates because your document schema has some dynamic keys which you can't use as an identifiey expression in the group operator pipeline. However, a workaround using the current schema would be to iterate over the find cursor and extract the values you want to add up within the loop. Something like the following:

pos_total = 0
neg_total = 0
neu_total = 0

cursor = db.collection.find()

for doc in cursor:          
    for i in range(0, 24):
        pos_total += doc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["pos"]
        neg_total += doc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["neg"]
        neu_total += ddoc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["neu"]

print(pos_total)
print(neg_total)
print(neu_total)

If you could do with changing the schema, then the following schema would be ideal for using the aggregation framework:

{
    "summary": {
        "HUL": [
            {
                "_id": "hr_0",          
                "ts": None,
                "Insights":{
                    "sentiments":{
                        "pos":37,
                        "neg":3,
                        "neu":27
                    },
                    "topics":[
                        "Basketball",
                        "Football"
                    ],
                    "geo":{
                        "locations":{
                            "Delhi":34,
                            "Kolkata":56,
                            "Pune":79,
                            "Bangalore":92,
                            "Mumbai":54
                        },
                        "mst_act":{
                            "loc":Bangalore, 
                            "lat_long":None
                        }
                    }
                }
            },
            {
                "_id": "hr_2",          
                "ts": None,
                "Insights":{
                    "sentiments":{
                        "pos":37,
                        "neg":3,
                        "neu":27
                    },
                    ...
                }
            },
            ...
            {
                "_id": "hr_23",         
                "ts": None,
                "Insights":{
                    "sentiments":{
                        "pos":37,
                        "neg":3,
                        "neu":27
                    },
                    ...
                }
            }
        ]
    }
}

The aggregation pipeline that would give you the required totals is:

var pipeline = [
    {
        "$unwind": "$summary.HUL"
    },
    {
        "$group": {
            "_id": "$summary.HUL._id",
            "pos_total": { "$sum": "$summary.HUL.Insights.sentiments.pos" },
            "neg_total": { "$sum": "$summary.HUL.Insights.sentiments.neg" },
            "neu_total": { "$sum": "$summary.HUL.Insights.sentiments.neu" },
        }
    }
]

result = db.collection.aggregate(pipeline)

Upvotes: 2

Related Questions