Mongo indexing fields for optimizing aggregation of big data

Question

{
    "_id" : ObjectId("59b7d232cb4ddc345c3bcef4"),
    "user_device_id" : ObjectId("59b7d21017c8a62f3a40c0bf"),
    "generated_at" : ISODate("2017-09-12T12:24:48.182Z"),
    "game_id" : ObjectId("59b683d4dd30770001513c75"),
    "device_type" : "iPhone 8 android sdk",
    "session_duration": 1000, /* in milliseconds */
    "device_os" : "iOS android sdk",
    "event_type" : "level_increase" /* new_user, game_session, ....*/
} // events collection

{
    "_id" : ObjectId("59b7d21017c8a62f3a40c0bf"),
    "generated_at" : ISODate("2017-09-12T12:24:48.182Z"),
    "game_id" : ObjectId("59b683d4dd30770001513c75"),
    "device_type" : "iPhone 8 android sdk",
    "device_os" : "iOS android sdk"
} // user_devices collection

I have a 'events' collection which has the similar above structure. I am having mongo as a big data analytics platform so to get the optimum speed for aggregation query in milliseconds or if not then in couple of seconds which fields should I index any idea. The total amount of documents will probably be around 10-100s of billions.

Aggregation query will be basically between two dates which are user retention per game, average user session per game, total user per game, query based on device_type or device_os per game.

ATM I have date field indexed.

Here is the example of a Day 1 user retention query:

UserDevice.aggregate(
        [
            {$match: {generated_at: {$gte: first_date, $lt: end_date}, game_id: "some game_id"}},
            {
                $lookup: {
                    from: "events",
                    localField: "_id",
                    foreignField: "user_device_id",
                    as: "event_docs"
                }
            },
            {
                $group: {
                    _id: { day: { $dayOfMonth: {$add: ["$generated_at", 1000*3600*24*1]}}, month: {$month: {$add: ["$generated_at", 1000*3600*24*1]}},  year: { $year: {$add: ["$generated_at", 1000*3600*24*1]} } }, total_users: {$sum: 1},
                    returned_users: {
                        $sum: {
                            $cond: { if: { $eq: [
                                    {
                                        $filter: {
                                            input: "$event_docs",
                                            as: "ed",
                                            cond: {
                                                $and: [
                                                    {$eq: [{ $dayOfMonth: {$add: ["$generated_at", 1000*3600*24*1]}}, { $dayOfMonth: "$$ed.generated_at" }]},
                                                    {$eq: [{ $month: {$add: ["$generated_at", 1000*3600*24*1]}}, { $month: "$$ed.generated_at" }]},
                                                    {$eq: [{ $year: {$add: ["$generated_at", 1000*3600*24*1]}}, { $year: "$$ed.generated_at" }]},
                                                    {$ne: ["$$ed.event_type", "new_user"]}
                                                ]

                                            }
                                        }
                                    }, []
                                ]}, then: 0, else: 1
                            }
                        }
                    }
                }
            }, {
                $sort: {"_id.year": 1, "_id.month": 1, "_id.day": 1}
            }
            ]).exec(function(err, results) {
                if (err) throw err;
                var latency = Date.now() - startTime;

                console.log("RETENTION RESULTS", JSON.stringify(results), "| latency:", latency,"ms");
            });

Mongo Version: 3.4.7

vvg · Accepted Answer

First of all indexes are used in aggregation in full only from MongoDB 3.2.

Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes

Make sure you have this version or newer. In general, you can create indexes for all fields that are in $match section of your queries.

I would also suggest go through ‘Aggregation Pipeline Optimization’ section of mongodb docs. It might be helpful For ex. you can use explain option to see how query is executed and understand if it can be optimized.

Mongo indexing fields for optimizing aggregation of big data

Answers (1)

Related Questions