knowbody
knowbody

Reputation: 8276

time series and aggregation framework (mongo)

I'm trying to synchronise two functions I run in my app. First one checks the count of the documents I save to MongoDB every time block (e.g. every 10 seconds) in the real time:

var getVolume = function(timeBlock, cb) {
    var triggerTime = Date.now();
    var blockPeriod = triggerTime - timeBlock;

    Document.find({
        time: { $gt: blockPeriod }
    }).count(function(err, count) {
        log('getting volume since ', new Date(blockPeriod), 'result is', count)
        cb(triggerTime, count);
    });
};

and then I have the second function which I use whenever I want to get a data for my graph (front end):

var getHistory = function(timeBlock, end, cb) {

    Document.aggregate(
    {
        $match: {
            time: {
                $gte: new Date(end - 10 * timeBlock),
                $lt: new Date(end)
            }
        }
    },

    // count number of documents based on time block
    // timeBlock is divided by 1000 as we use it as seconds here
    // and the timeBlock parameter is in miliseconds
    {
        $group: {
            _id: {
                year: { $year: "$time" },
                month: { $month: "$time" },
                day: { $dayOfMonth: "$time" },
                hour: { $hour: "$time" },
                minute: { $minute: "$time" },
                second: { $subtract: [
                    { $second: "$time" },
                    { $mod: [
                        { $second: "$time" },
                        timeBlock / 1000
                    ]}
                ]}
            },
            count: { $sum: 1 }
        }
    },

    // changing the name _id to timeParts
    {
        $project: {
            timeParts: "$_id",
            count: 1,
            _id: 0
        }
    },

    // sorting by date, from earliest to latest
    {
        $sort: {
            "time": 1
        }
    }, function(err, result) {
        if (err) {
            cb(err)
        } else {
            log("start", new Date(end - 10 * timeBlock))
            log("end", new Date(end))
            log("timeBlock", timeBlock)
            log(">****", result)
            cb(result)
        }
    })
}

and the problem is that I can't get the same values on my graph and on the back-end code (getVolume function)

I realised that the log from getHistory is not how I would expect it to be (log below):

start Fri Jul 18 2014 11:56:56 GMT+0100 (BST)
end Fri Jul 18 2014 11:58:36 GMT+0100 (BST)
timeBlock 10000
>**** [ { count: 4,
    timeParts: { year: 2014, month: 7, day: 18, hour: 10, minute: 58, second: 30 } },
  { count: 6,
    timeParts: { year: 2014, month: 7, day: 18, hour: 10, minute: 58, second: 20 } },
  { count: 3,
    timeParts: { year: 2014, month: 7, day: 18, hour: 10, minute: 58, second: 10 } },
  { count: 3,
    timeParts: { year: 2014, month: 7, day: 18, hour: 10, minute: 58, second: 0 } },
  { count: 2,
    timeParts: { year: 2014, month: 7, day: 18, hour: 10, minute: 57, second: 50 } } ]

So I would expect that the getHistory should look up data in mongo every 10 seconds starting from start Fri Jul 18 2014 11:56:56 GMT+0100 (BST) so it will look roughly like:

11:56:56 count: 3
11:57:06 count: 0
11:57:16 count: 14
... etc.

TODO: 1. I know I should cover in my aggregate function the case when the count is 0 at the moment I guess this time is skipped`

Upvotes: 3

Views: 1023

Answers (1)

Leonid Beschastny
Leonid Beschastny

Reputation: 51480

Your error is how you're calculating _id for $group operator, specifically its second part:

second: { $subtract: [
    { $second: "$time" },
    { $mod: [
        { $second: "$time" },
        timeBlock / 1000
    ]}
]}

So, instead of splitting all your data into 10 timeBlock milliseconds long chunks starting from new Date(end - 10 * timeBlock), you're splitting it into 11 chunks starting from from the nearest divisor of timeBlock.

To fix it you should first calculate delta = end - $time and then use it instead of the original $time to build your _id.

Here is an example of what I mean:

Document.aggregate({
    $match: {
        time: {
            $gte: new Date(end - 10 * timeBlock),
            $lt: new Date(end)
        }
    }
}, {
    $project: {
        time: 1,
        delta: { $subtract: [
            new Date(end),
            "$time"
        ]}
    }
}, {
    $project: {
        time: 1,
        delta: { $subtract: [
            "$delta",
            { $mod: [
                "$delta",
                timeBlock
            ]}
        ]}
    }
}, {
    $group: {
        _id: { $subtract: [
            new Date(end),
            "$delta"
        ]},
        count: { $sum: 1 }
    }
}, {
    $project: {
        time: "$_id",
        count: 1,
        _id: 0
    }
}, {
    $sort: {
        time: 1
    }
}, function(err, result) {
    // ...
})

I also recommend you to use raw time values (in milliseconds), because it's much easier and because it'll keep you from making a mistake. You could cast time into timeParts after $group using $project operator.

Upvotes: 2

Related Questions