Reputation: 53

Developing web analytics with Hadoop

I want to develop a web analytics platform in order to create aggregated data about web traffic (page views, visits, visitors, etc) by parsing apache access logs.

Can I do it only with Hadoop and pure Map/Reduce jobs?

Is it an overkill or a “must” to use Hive?

Upvotes: 2

Answers (4)

Renata Ghisloti

Reputation: 557

If you decide to use the Hadoop and Hive or Pig package to solve your problem, it might save some time to download the Cloudera's or IBm's Hadoop package. They already come with all the Hadoop framework, including Pig and Hive, and usually provide a step-by-step web-interface installation process.

Their initial version is for free:

http://www-01.ibm.com/software/data/infosphere/biginsights/ http://www.cloudera.com/content/support/en/downloads.html

If you don't want to loose that much time with the framework itself, it might be a good solution. Hope it helps!

Upvotes: 0

Joel

Reputation: 21

Check out Datameer, they have a bunch of pre-packaged functions to do clickstream analysis built in on top of Hadoop... They also support Google Analytics if you are using this tool already.

Upvotes: 2

David Gruzman

Reputation: 8088

I think that hive is most suitable platform for this kind of tasks, since most of the aggregations are naturally mapped to group by SQLs.
What you might need - is to extend Hive with two things:
a) SerDe to read your logs format.
b) IP2Country UDF(user defined function) to group by your logs by country.

I do not think it makes much sense to create vanilla MR jobs for this task. I would formulate that tasks which are usually solved with RDBMS should be first tried with Hive.

Upvotes: 1

Praveen Sripati

Reputation: 33535

Hive or Pig is a layer of abstraction over Hadoop MapReduce jobs to make creating/running MR Jobs easy. Pig and Hive scripts are easy to write and would be automatically converted into MR Jobs.

As with any layer of abstraction, Pig and Hive scripts take considerably less amount of time to write than MR Job in Java, but are a bit overhead. As Pig and Hive become more and more mature this gap will narrow.

Kevin quantified his experience, he found typically a Pig script is 5% of the code of native map/reduce written in about 5% of the time. However, queries typically take between 110-150% the time to execute that a native map/reduce job would have taken.

To summarize, Hive is not a must, but will make it easier to create/run an MR jobs for the end-user with a bit of overhead.

Upvotes: 4

Developing web analytics with Hadoop

Answers (4)

Related Questions