user144390
user144390

Reputation: 305

Processing live feed of logs from web server using Hadoop

I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this.

Upvotes: 2

Views: 746

Answers (4)

Suman
Suman

Reputation: 9571

If you want true real-time processing, you might want to look at Twitter's Storm, which is open-source and hosted on GitHub. Tutorial here.

It looks like it is being used in production at large companies.

On that note, I don't use Storm myself, and actually do something similar to what has been in mentioned in the question and responses:

  1. Log events using Apache (using rotatelogs for changing log files every 15/30 minutes)
  2. Upload them every so often to S3
  3. Add a new step to an existing Hadoop cluster (on Amazon EMR)

With Hadoop, you can get close to real-time by running the batch processing often on a cluster and just adding a new job, but not true real-time. For that you need Storm or something similar.

Upvotes: 0

Andrei Savu
Andrei Savu

Reputation: 8685

Something you can try is to use Flume as a log collector and store them in S3 for batch processing:

http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/

Upvotes: 0

Harsha Hulageri
Harsha Hulageri

Reputation: 2830

Hadoop is not used for live real time processing. But it can be used to process logs on hourly basis may be one hour behind which is near real time. I wonder what is the need of processing logs as it comes.

Upvotes: 1

mojbro
mojbro

Reputation: 1529

Hadoop is usually used in an offline manner. So I would rather process the logs periodically.

In a project I was involved with previously, we made our servers produce log files that were rotated hourly (every hour at x:00). We had a script that ran hourly (every hour at x:30) uploaded the files into HDFS (those that weren't already there). Then you can run jobs as often as you like in Hadoop to process these files.

I am sure there are better real-time alternatives too.

Upvotes: 1

Related Questions