Reputation: 143
I have a server that generate some log files every 1 second and I want to process this file using Apache Spark.
I write a spark application using python and in a while loop I process a group of log files.
I stop sparkContext in each iteration and start it for next step.
My question is that what is the best approach for this kind of application that runs infinitely and process batches or group of generated files. should I use a infinite while loop or should I run my code in cron job or even scheduling frameworks like airflow?
Upvotes: 1
Views: 983
Reputation: 578
The best possible way to solve this is to use "Spark Streaming". Spark streaming enables you to process live data streams.Spark streaming currently works with Kafka,Flume,HDFS,S3,Amazon Kinesis and Twitter.Hence,you should first insert these logs into Kafka and then write a Spark streaming program which processes live stream of logs.This is a cleaner solution instead of using infinite loops and starting and stopping SparkContext multiple times.
Upvotes: 3