Reputation: 305
I want to process the logs from my web server as it comes in using Hadoop (Amazon Elastic mapreduce). I googled for help but nothing useful. I would like to know if this can be done or is there any alternative way to do this.
Upvotes: 2
Views: 746
Reputation: 9571
If you want true real-time processing, you might want to look at Twitter's Storm, which is open-source and hosted on GitHub. Tutorial here.
It looks like it is being used in production at large companies.
On that note, I don't use Storm myself, and actually do something similar to what has been in mentioned in the question and responses:
With Hadoop, you can get close to real-time by running the batch processing often on a cluster and just adding a new job, but not true real-time. For that you need Storm or something similar.
Upvotes: 0
Reputation: 8685
Something you can try is to use Flume as a log collector and store them in S3 for batch processing:
http://www.cloudera.com/blog/2011/02/distributed-flume-setup-with-an-s3-sink/
Upvotes: 0
Reputation: 2830
Hadoop is not used for live real time processing. But it can be used to process logs on hourly basis may be one hour behind which is near real time. I wonder what is the need of processing logs as it comes.
Upvotes: 1
Reputation: 1529
Hadoop is usually used in an offline manner. So I would rather process the logs periodically.
In a project I was involved with previously, we made our servers produce log files that were rotated hourly (every hour at x:00). We had a script that ran hourly (every hour at x:30) uploaded the files into HDFS (those that weren't already there). Then you can run jobs as often as you like in Hadoop to process these files.
I am sure there are better real-time alternatives too.
Upvotes: 1