Reputation: 3752
I have a requirement to process log file data. It is relatively trivial. I have 4 servers with 2 web applications running on each for a total of 8 log files. These get rotated on a regular basis. I'm writing data in the following format into these log files
Source Timestamp :9340398;39048039;930483;3940830
Where the numbers are identifiers in a data store. I want to set up a process to read these logs and for each id it will update a count depending on the number of times its id has been logged. It can either be real time or batch. My interface language to the datastore is Java. The process runs in production so needs to be robust but also needs to have a relatively simple architecture so it is maintainable. We also run zookeeper.
My initial thought was to do this in a batch whenever the log file is rotated running an Apache spark on each server. However I then got to looking at log agregators such as Apache Flume, Kafka and Storm, but this seems like overkill.
Given the multitude of choices has anyone got any good suggestions as to which tools to use to handle this problem based on experience?
Upvotes: 0
Views: 750
Reputation: 25929
8 log files don't seem to warrant any "big data" technology. If you do want a play/get started with these type of technology I'd recommend you'd start with Spark and/or Flink - both have relatively similar programming model both both can handle "business real-time" (Flink is better at streaming but both would seem to work in your case). Storm is relatively rigid (hard to change topologies) and has a more complex programming model
Upvotes: 1