Prata
Prata

Reputation: 1320

Mongo indexing fields for optimizing aggregation of big data

{
    "_id" : ObjectId("59b7d232cb4ddc345c3bcef4"),
    "user_device_id" : ObjectId("59b7d21017c8a62f3a40c0bf"),
    "generated_at" : ISODate("2017-09-12T12:24:48.182Z"),
    "game_id" : ObjectId("59b683d4dd30770001513c75"),
    "device_type" : "iPhone 8 android sdk",
    "session_duration": 1000, /* in milliseconds */
    "device_os" : "iOS android sdk",
    "event_type" : "level_increase" /* new_user, game_session, ....*/
} // events collection

{
    "_id" : ObjectId("59b7d21017c8a62f3a40c0bf"),
    "generated_at" : ISODate("2017-09-12T12:24:48.182Z"),
    "game_id" : ObjectId("59b683d4dd30770001513c75"),
    "device_type" : "iPhone 8 android sdk",
    "device_os" : "iOS android sdk"
} // user_devices collection

I have a 'events' collection which has the similar above structure. I am having mongo as a big data analytics platform so to get the optimum speed for aggregation query in milliseconds or if not then in couple of seconds which fields should I index any idea. The total amount of documents will probably be around 10-100s of billions.

Aggregation query will be basically between two dates which are user retention per game, average user session per game, total user per game, query based on device_type or device_os per game.

ATM I have date field indexed.

Here is the example of a Day 1 user retention query:

UserDevice.aggregate(
        [
            {$match: {generated_at: {$gte: first_date, $lt: end_date}, game_id: "some game_id"}},
            {
                $lookup: {
                    from: "events",
                    localField: "_id",
                    foreignField: "user_device_id",
                    as: "event_docs"
                }
            },
            {
                $group: {
                    _id: { day: { $dayOfMonth: {$add: ["$generated_at", 1000*3600*24*1]}}, month: {$month: {$add: ["$generated_at", 1000*3600*24*1]}},  year: { $year: {$add: ["$generated_at", 1000*3600*24*1]} } }, total_users: {$sum: 1},
                    returned_users: {
                        $sum: {
                            $cond: { if: { $eq: [
                                    {
                                        $filter: {
                                            input: "$event_docs",
                                            as: "ed",
                                            cond: {
                                                $and: [
                                                    {$eq: [{ $dayOfMonth: {$add: ["$generated_at", 1000*3600*24*1]}}, { $dayOfMonth: "$$ed.generated_at" }]},
                                                    {$eq: [{ $month: {$add: ["$generated_at", 1000*3600*24*1]}}, { $month: "$$ed.generated_at" }]},
                                                    {$eq: [{ $year: {$add: ["$generated_at", 1000*3600*24*1]}}, { $year: "$$ed.generated_at" }]},
                                                    {$ne: ["$$ed.event_type", "new_user"]}
                                                ]

                                            }
                                        }
                                    }, []
                                ]}, then: 0, else: 1
                            }
                        }
                    }
                }
            }, {
                $sort: {"_id.year": 1, "_id.month": 1, "_id.day": 1}
            }
            ]).exec(function(err, results) {
                if (err) throw err;
                var latency = Date.now() - startTime;

                console.log("RETENTION RESULTS", JSON.stringify(results), "| latency:", latency,"ms");
            });

Mongo Version: 3.4.7

Upvotes: 0

Views: 192

Answers (1)

vvg
vvg

Reputation: 6385

First of all indexes are used in aggregation in full only from MongoDB 3.2.

Changed in version 3.2: Starting in MongoDB 3.2, indexes can cover an aggregation pipeline. In MongoDB 2.6 and 3.0, indexes could not cover an aggregation pipeline since even when the pipeline uses an index, aggregation still requires access to the actual documents. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes

Make sure you have this version or newer. In general, you can create indexes for all fields that are in $match section of your queries.

I would also suggest go through ‘Aggregation Pipeline Optimization’ section of mongodb docs. It might be helpful For ex. you can use explain option to see how query is executed and understand if it can be optimized.

Upvotes: 1

Related Questions