Reputation: 524
I'm parsing access logs generated by Apache, Nginx, Darwin (video streaming server) and aggregating statistics for each delivered file by date / referrer / useragent.
Tons of logs generated every hour and that number likely to be increased dramatically in near future - so processing that kind of data in distributed manner via Amazon Elastic MapReduce sounds reasonable.
Right now I'm ready with mappers and reducers to process my data and tested the whole process with the following flow:
I've done that manually according to thousands of tutorials that are googlable on the Internet about Amazon ERM.
What should I do next? What is a best approach to automate this process?
I think that this topic can be useful for many people who try to process access logs with Amazon Elastic MapReduce but were not able to find good materials and/or best practices.
UPD: Just to clarify here is the single final question:
What are best practices for logs processing powered by Amazon Elastic MapReduce?
Related posts:
Getting data in and out of Elastic MapReduce HDFS
Upvotes: 10
Views: 2609
Reputation: 41458
That's a very very wide open question, but here are some thoughts you could consider:
Hope that gives you some clues.
Upvotes: 2