Reputation: 21

Mongodb mapreduce optimization

I have a collection of hits stored on Mongodb with this schema: { userid: ... date: ... }

I want to display a report with computation of unique visitors between two dates (visitors with different userid who have made a hit between these dates).

Example of output:

Number of visitors: ... Number of hits: ...

The collection's size is about 1M records.

My first idea is to do incremental mapreduce to compute aggregated values by day. And then a second mapreduce on the days to output the final result.

The problem is when a select a range of dates on the report, i'm not able to compute the correct number of unique visitors.

Example of aggregated values by day: Day 1: 1 unique visitors Day 2: 2 unique visitors (1 of the 2 visitors has made a hit on day 1)

The sum of unique visitors is 3 on the two days but the whole period there are only 2 unique visitors and not 3.

Have you any performant way to compute unique visitors on this example?

Upvotes: 1

Answers (2)

Asya Kamsky

Reputation: 42352

You can do this easily with version 2.2 and its aggregation framework.

Assuming schema {userid: " ", date: " "} and given two specific dates d1 and d2 this is the pipeline:

db.collection.aggregate(
[
    {
        "$match" : {
            "date" : {
                "$gte" : d1,
                "$lte" : d2
            }
        }
    },
    {
        "$group" : {
            "_id" : "$userid",
            "hits" : {
                "$sum" : 1
            }
        }
    },
    {
        "$group" : {
            "_id" : "1",
            "visitors" : {
                "$sum" : 1
            },
            "hits" : {
                "$sum" : "$hits"
            }
        }
    },
    {
        "$project" : {
            "_id" : 0,
            "visitors" : 1,
            "hits" : 1
        }
    }
]

Upvotes: 0

ACE

Reputation: 484

This problem might be easier to solve by using a single map-reduce over the desired dates. Instead of first aggregating the unique users for a single day (your first step), you could do this same aggregation over all of the dates you wish to check. In this way you can avoid the second step entirely.

To break this down into the Map and Reduce sections:

Map: Find all of the userids that were recorded during the desired time range

Reduce: Remove all duplicated userids

Once this process is complete you should be left with the set of unique visitors (more specifically, unique userids) for that time range.

Alternately, there is an even easier way to do this that does not require map-reduce at all. The "distinct" command (see the mongoDB distinct documentation) allows you to select a field and return an array filled with only distinct (unique) values for that field. If you used the distinct command on the documents within the desired time range, you will be able to get an array that contains all the userids from that period without any duplicates.

Hope this helps!

Upvotes: 3

Mongodb mapreduce optimization

Answers (2)

Related Questions