Reputation: 493
I have to analyze Gzip compressed log files which are stored on a production server using Hadoop related tools .
I can't decide on how to do that, and what to use, here are some of the methods i thought about using (Feel free to recommend something else):
Before i could do anything, i need to get the compressed files from the production server and process them then push them into Apache HBase
Upvotes: 3
Views: 3571
Reputation: 493
As I have log files stored in production server, I am going to copy these files into HDFS and I have written mapreduce program to process it.
I think @Marko Bonaci answer is valid, we can try with spark to analyze the log files.
Thanks all for your valuable input.
Upvotes: 0
Reputation: 5716
Depending on the size of your logs (assuming that the computation won't fit on a single machine, i.e. requires a "big data" product), I think it might be most appropriate to go with Apache Spark. Given that you don't know much about the ecosystem it might be best to go with Databricks Cloud, which will give you a straightforward way of reading your logs from HDFS and analyzing using Spark transformations in a visual way (with a Notebook).
You can find this video on the link above.
There's a free trial so you can see how that would go and then decide.
PS I'm in no way affiliated with Databricks. Just think they have a great product, that's all :)
Upvotes: 5
Reputation: 38950
You have mixed many inter-related concepts which are not alternatives to each other.
Have a look at hadoop ecosystem
Apache Map Reduce is : A YARN (Yet Another Resource Negotiator) based system for parallel processing of large data sets. It provides simple programming API.
Apache Kafka is a Distributed publish-subscribe system for processing large amounts of streaming data. You can treat Kafka as a simple "Message Store"
Apache Flume is specially designed for collection, aggregation, and movement of large amounts of log data (in unstructured format) into HDFS system. It collects data from various HTTP sources and web servers.
Once data is imported from Flume to HDFS, it can be converted into structured data with PIG or Hive and reports can be generated in Structured form. PIG or HIVE runs a series of Map Reduce Jobs to process this data and generate reports.
Have a look at this article to have better understanding on log file processing architecture.
Upvotes: 1
Reputation: 6181
Each of the tools you mentions is doing something else -
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log Map reduce is more of design pattern for processing data.
My suggestion is to define better what you really look for an examine the relevant tools.
Upvotes: 0