Pausing/throttling spark/spark-streaming application

Question

I wrote a simple spark-streaming application which basically reads stream of events from Kafka and stores these events to Cassandra in a table allowing efficient queries over these data. The main purpose of this job is to process current real-time data. But there are also historical events stored in hdfs.

I want to reuse the code processing RDDs (a part of the streaming job) in a historical job and I am wondering what is the best solution for reading historical data according to the following requirements:

Historical events are stored in daily rolled up files in hdfs (I want to run the job on a range of historical files)
It would be nice to have a possibility of pausing the job (inserts to cassandra are idempotent, so I need at-least-once processing)
I want to have some throttling mechanism allowing to define maximum number of events which can be processed (in a period of time: e.g every 1min)

I've considered two approaches so far:

Batch Spark job
- Ad1: Is there a better way to define RDD based on range of files than creating one RDD for each file and then union them?
- Ad2,3: Is it possible?
Spark Streaming job
- Ad1: How to efficiently define a range of input files? Sth better than using ssc.textFileStream(inputDir) and copying files which I want to process to this directory?
- Ad2: I assume that setting a checkpoint directory is what I am looking for.
- Ad3 I plan to use spark.streaming.receiver.maxRate property

Am I right that regular batch spark cannot meet my requirements? I am waiting for your advice regarding to spark streaming solution.

Pausing/throttling spark/spark-streaming application

Answers (1)

Related Questions