Reputation: 771
I wrote a simple spark-streaming application which basically reads stream of events from Kafka and stores these events to Cassandra in a table allowing efficient queries over these data. The main purpose of this job is to process current real-time data. But there are also historical events stored in hdfs.
I want to reuse the code processing RDDs (a part of the streaming job) in a historical job and I am wondering what is the best solution for reading historical data according to the following requirements:
I've considered two approaches so far:
ssc.textFileStream(inputDir)
and copying files which I want to process to this directory?spark.streaming.receiver.maxRate
propertyAm I right that regular batch spark cannot meet my requirements? I am waiting for your advice regarding to spark streaming solution.
Upvotes: 3
Views: 1946
Reputation: 1808
For Batch Spark job, 1. You can give comma separated file names in sc.***File operations 2, 3. Since you will be able to
For Streaming job, 1. You could define the RDDs for the files and insert them using queueStream. 2. Depends on what you mean by pausing. You could simply stop the streaming context gracefully when you want to pause. 3. Yes, that is it.
But stepping back, you can do a lot of code sharing in the RDD and DStream transformation. Whatever you do for RDDs in your batch part, could be reused within DStream.transform() in your streaming part.
Upvotes: 2