Apache-Spark as log store

Question

I have a few questions around using apache-spark to store our application logs (yes, storing the logs in apache-spark, NOT storing logs that apache-spark creates)

1) Is storing (and of course analyzing) logs in apache-spark a good use case for the product? Just looking for "yes, depending what you mean by good" - or "no, its unlikely to be a good fit for classic log storage / analyses, use ElasticSearch for that"

2) What would be the best way to write new logs from our application to the spark cluster? https://spark.apache.org/docs/0.9.0/streaming-programming-guide.html mentions "Data can be ingested from ... plain old TCP sockets" But I haven't been able to find a guide on how to open / ingest data from, a TCP socket.

3) If we use logback within our application, what would be the correct appender to define to save the logs to the spark cluster?

I realize these questions are quite high level, so just looking for guidance for if I'm on the right track, and perhaps some links to articles that can help me further my understanding - not a detailed implementation to the rather big questions!

Thanks

samthebest · Accepted Answer

Yes Spark can work very well for log mining.

It depends on what your analysis will be - if your only going to do lookups and greps then possibly ElasticSearch could fit too, but the second you wish to do something more complicated then Spark will be better. The nice thing about Spark is it's flexibility.
Depends on your analysis again and when you want that analysis. If you want a real time dashboard, then yes try to find a way to use SparkStreaming. If you just hourly / daily updates then just write to hdfs and stick a Spark job in cron.
I recommend Apache Flume so that you can write your logs straight to HDFS http://flume.apache.org/

Yes I'd say your on the right track.

Apache-Spark as log store

Answers (1)

Related Questions