Pratik Bothra
Pratik Bothra

Reputation: 2694

Aggregation in mongodb by searching for range of keys in embedded object

Mongodb site, https://docs.mongodb.org/ecosystem/use-cases/pre-aggregated-reports/, offers this example of a document for faster look up at daily, monthly level.

{
    _id: "201010/site-1/apache_pb.gif",
    metadata: {
        date: ISODate("2000-10-00T00:00:00Z"),
        site: "site-1",
        page: "/apache_pb.gif" },
    daily: {
        "1": {"sessions": 300, "bounces": 10}
        "2": {"sessions": 100, "bounces": 5},
        "3": {"sessions": 10},
        "4": {"sessions": 100, "bounces": 4},
        ... }
}

For example, to retrieve data for a specific day

   db.stats.monthly.find_one({ },{ 'daily.1': 1 'metadata': 1})

The above schema works great for me as well for most use cases as documented, as it is essentially just lookups.

For the few cases that we might have custom date range is something I'm struggling at, so if the user searches for 1st Jan - 3nd Jan => I'm ideally expecting this result.

[{
        _id: "201010/site-1/apache_pb.gif",
        metadata: {
            date: ISODate("2000-10-00T00:00:00Z"),
            site: "site-1",
            page: "/apache_pb.gif" },
        result: {
           "sessions": 410, "bounces": 15 }
    }, {

}, {
        _id: "201010/site-1/apache_new.gif",
        metadata: {
            date: ISODate("2000-05-00T00:00:00Z"),
            site: "site-1",
            page: "/apache_new.gif" },
        result: {
           "sessions": 310, "bounces": 8 }
    }, {

}...
]

I understand that we need to do aggregation here, but totally confused if it is even possible to aggregation by giving a range for keys in the embedded object.

Will I have to restructure my schema for this to be possible? I really love the efficient lookups, and they serve 80-90% of our use case.

Upvotes: 0

Views: 794

Answers (1)

Blakes Seven
Blakes Seven

Reputation: 50416

Your current data storage format does not play well with the aggregation framework or indeed MongoDB queries in general. The core problem is your "daily" object just contains named keys for each item. This means that in order to access anything, MongoDB needs to be given the specific path to that key. Such as "daily.1", just as you mention.

As stated, the aggregation framework and general MongoDB operations cannot "traverse the keys of an object", so you would need server side JavaScript in order to collect data from all the keys.

An approach that is more in line with optimised MongoDB capabilties, is to store the "daily" data in an array instead:

{
    _id: "201010/site-1/apache_pb.gif",
    metadata: {
        date: ISODate("2000-10-00T00:00:00Z"),
        site: "site-1",
        page: "/apache_pb.gif"
    },
    daily: [
        { "day": 1, "sessions": 300, "bounces": 10},
        { "day": 2, "sessions": 100, "bounces": 5},
        { "day": 3, "sessions": 10},
        { "day": 4, "sessions": 100, "bounces": 4}
    ]
}

Then you can run an aggregation over the content quite simply:

db.colllection.aggregate([
    // Match relevant objects 
    { "$match": {
        "daily": { 
            "$elemMatch": { 
                "day": { "$gte": 1, "$lte": 3 } 
            }
        }
    }},

    // Unwind to denormalize array
    { "$unwind": "$daily" },

    // Filter the required results
    { "$match": {
        "daily.day": { "$gte": 1, "$lte": 3 }
    }},
    // Group data and sum totals
    { "$group": {
        "_id": "$_id",
        "metadata": { "$first": "$metadata" },
        "resultSessions": { "$sum": "$daily.sessions" },
        "resultBounces": { "$sum": "$daily.bounces" }
    }},

    // Optionally project to desired format
    { "$project": {
        "metadata": 1,
        "result": {
            "sessions": "$resultSessions",
            "bounces": "$resultBounces"
        }
    }}
])

Or better yet, pre-filter the array before unwinding:

db.colllection.aggregate([
    { "$match": {
        "daily": { 
            "$elemMatch": { 
                "day": { "$gte": 1, "$lte": 3 } 
            }
        }
    }},
    { "$project": {
        "metadata": 1,
        "daily": {
            "$setDifference": [
                { "$map": {
                    "input": "$daily",
                    "as": "day",
                    "in": {
                        "$cond": [
                            { "$and": [
                                { "$gte": [ "$day.day", 1 ] },
                                { "$lte": [ "$day.day", 3 ] }
                            ]},
                            "$day",
                            false
                        ]
                    }
                }},
                [false]
            ]
        }
    }},
    { "$unwind": "$daily" },
    { "$group": {
        "_id": "$_id",
        "metadata": { "$first": "$metadata" },
        "resultSessions": { "$sum": "$daily.sessions" },
        "resultBounces": { "$sum": "$daily.bounces" }
    }},
    { "$project": {
        "metadata": 1,
        "result": {
            "sessions": "$resultSessions",
            "bounces": "$resultBounces"
        }
    }}
])

And please always $match relevant objects first to reduce what is being processed.

As the properties in the data now share all the same paths and are not tied to an outer key they can now be easily accumulated.

Without this structural change, the only way to aggregate on the server is to use mapReduce, which can use a coded function to iterate the object keys:

db.collection.mapReduce(
    function() {
        var result = { "sessions": 0, "bounces": 0 };
        Object.keys(this.daily)
            .filter(function(key) {
               return ( key >= 1 && key <= 3 );
            })
            .forEach(function(key) {
                result.sessions += this.daily[key].sessions;
                result.bounces += this.daily[key].bounces;
            });
        emit(this._id,{ metadata: this.metadata, result: result });
   },
   function() {},  // won't be called for unique _id values
   { 
       "out": { "inline": 1 },
       "query": {
           "daily": { 
               "$elemMatch": { 
                   "day": { "$gte": 1, "$lte": 3 } 
               }
       }
    }},


   }
)

Of course in both cases adjusting the grouping depending on whether you intend to accumulate across documents or not.

And of course if you are not actually accumulating across documents at all, then just to the same type of key traversal in your own client receiving code.

Upvotes: 3

Related Questions