webdevbyjoss
webdevbyjoss

Reputation: 524

Amazon MapReduce best practices for logs analysis

I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.

Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.

Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:

I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.

What should I do next? What is a best approach to automate this process?

I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.

UPD: Just to clarify here is the single final question:

What are best practices for logs processing powered by Amazon Elastic MapReduce?

Related posts:

Getting data in and out of Elastic MapReduce HDFS

Upvotes: 10

Views: 2609

Answers (1)

Charles Menguy
Charles Menguy

Reputation: 41458

That's a very very wide open question, but here are some thoughts you could consider:

  • Using Amazon SQS: this is a distributed queue, and is very useful for workflow management, you cna have a process that writes to the queue as soon as a log is available, and another who reads from it, processes the log described in the queue message, and deletes it when it's done processing. This would ensure that logs are processed only once.
  • Apache Flume as you mentionned is very useful for log aggregation. This is something you should consider, even if you don't need real-time, as it gives you at the very least a standardized aggregation process.
  • Amazon recently release SimpleWorkFlow. I have just started looking into it, but that sounds promising to manage every step of your data pipeline.

Hope that gives you some clues.

Upvotes: 2

Related Questions