Reputation: 15475

What's the best way to count unique visitors with Hadoop?

hey all, just getting started on hadoop and curious what the best way in mapreduce would be to count unique visitors if your logfiles looked like this...

DATE       siteID  action   username
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview tom
05-05-2010 siteA   pageview jim
05-05-2010 siteB   pageview bob
05-05-2010 siteA   pageview mike

and for each site you wanted to find out the unique visitors for each site?

I was thinking the mapper would emit siteID \t username and the reducer would keep a set() of the unique usersnames per key and then emit the length of that set. However that would be potentially storing millions of usernames in memory which doesn't seem right. Anyone have a better way?

I'm using python streaming by the way

thanks

Upvotes: 9

Answers (4)

SquareCog

Reputation: 19666

Use the secondary sort to sort on user id. That way, you don't need to have anything in memory -- just stream the data through, and increment your distinct counter every time you see the value change for a particular site id.

Here is some documentation.

Upvotes: 1

Datageek

Reputation: 26689

It is often faster to use HiveQL to sort many simple tasks. Hive will translate your queries into Hadoop MapReduce. In this case you may use

SELECT COUNT(DISTINCT username) FROM logviews

You may find a more advanced example here: http://www.dataminelab.com/blog/calculating-unique-visitors-in-hadoop-and-hive/

Upvotes: 0

Niels Basjes

Reputation: 10642

My aproach is similar to what tzaman gave with a small twist

map output : (username, siteid) => ("")
reduce output: (siteid) => (1)
map : identity mapper
reduce : longsumreducer (i.e. simply summarize)

Note that the first reduce does not need to go over any of the records is gets presented. You can simply examine the key and produce the output.

HTH

Upvotes: 1

tzaman

Reputation: 47790

You could do it as a 2-stage operation:

First step, emit (username => siteID), and have the reducer just collapse multiple occurrences of siteID using a set - since you'd typically have far less sites than users, this should be fine.

Then in the second step, you can emit (siteID => username) and do a simple count, since the duplicates have been removed.

Upvotes: 3

What&#39;s the best way to count unique visitors with Hadoop?

Answers (4)

Related Questions

What's the best way to count unique visitors with Hadoop?