Lee Theobald
Lee Theobald

Reputation: 8587

Aggregation: Counting Items Per Hour

I was wondering if someone could help me get my aggregation function right. I'm trying to count the number of times a piece of text appears per hour in a specified day. So far I've got:

db.daily_data.aggregate(
  [
    { $project : { useragent: 1, datetime: 1, url: 1, hour: {$hour: new Date("$datetime")} } },
    { $match : { datetime: {$gte: 1361318400000, $lt: 1361404800000}, useragent: /.*LinkCheck by Siteimprove.*/i } },
    { $group : { _id : { useragent: "$useragent", hour: "$hour" }, queriesPerUseragent: {$sum: 1} } }
  ]
);

But I'm obviously getting it wrong as hour is always 0:

{
  "result" : [
    {
      "_id" : {
        "useragent" : "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0) LinkCheck by Siteimprove.com",
        "hour" : 0
      },
      "queriesPerUseragent" : 94215
    }
  ],
  "ok" : 1
}

Here's a trimmed down example of a record too:

{
  "_id" : ObjectId("50fe63c70266a712e8663725"),
  "useragent" : "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.0) LinkCheck by Siteimprove.com",
  "datetime" : NumberLong("1358848954813"),
  "url" : "http://www.somewhere.com"
}

I've also tried using new Date("$datetime").getHours() instead of the $hour function to try and get the same result but with no luck. Can someone point me in the direction of where I'm going wrong?

Thanks!

Upvotes: 3

Views: 6155

Answers (1)

Samuel García
Samuel García

Reputation: 2225

This is a recommendation rather an answer for your problem.

On MongoDB for analytics it's recommended to pre-aggregate your buckets (hourly buckets in your use case) for every metric you want to calculate.

So, for your metric you can update your pre-aggregated collection (speeding up your query time):

db.user_agent_hourly.update({url: "your_url", useragent: "your user agent", hour: current_HOUR_of_DAY, date: current_DAY_Date}, {$inc: {counter:1}}, {upsert:true})

Take into account that in current_DAY_Date you have to point to stable date value for the current day, i.e., current_year/current_month/current_day 00:00:00 , using the same hour:minute:second to every metric received in current day.

Then, you can query this collection, extracting aggregated analytics for any given period of time as follows:

 db.user_agent_hourly.aggregate(
    {$match:{date:{$gte: INITIAL_DATE, $lt: FINAL_DATE}}},
    {$group:{ _id : { useragent: "$useragent", hour: "$hour" } ,queriesPerUseragent: {$sum: "$count"} } },
    {$sort:{queriesPerUseragent:-1}}
 )

If you want to filter the results using a specific user agent, you can use the next query:

 db.user_agent_hourly.aggregate(
    {$match:{date:{$gte: INITIAL_DATE, $lt: FINAL_DATE, useragent: "your_user_agent"}}},
    {$group:{ _id : { useragent: "$useragent", hour: "$hour" }, queriesPerUseragent: {$sum: "$count"} } }
 )

PS: We store every single received metric in other collection to be able to reprocess it in case of disaster or other needs.

Upvotes: 3

Related Questions