Reputation: 21
What would be the best way to go about batching RDD content to text files of approximately 100MB after which would be uploaded to S3? dstream.foreachRDD seems to only allow processing per RDD, and does not let me accumulate RDDs until a certain size.
Perhaps I'm missing something. Apache spark-streaming concepts are still pretty new and unfamiliar to me. I want to make a streaming application that takes data from kafka, batch messages into large files then upload them online.
Related question: dstream.foreachRDD runs func on the driver application, according to documentation. Does this mean I can only have one node in a spark cluster perform all the uploading? Doesn't that mean I'll be heavily network i/o capped?
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html
Upvotes: 0
Views: 65
Reputation: 13927
What about using RDD.union
to collect each RDD into a base RDD? Something like this:
var baseRdd: RDD[String] = sc.emptyRDD
var chunkSize = 0
val threshold = 1000000
dstream.foreachRDD { newRdd =>
baseRdd = baseRdd.union(newRdd)
chunkSize = chunkSize + calculateBatchSize(newRdd)
if (chunkSize > threshold) {
writeOutRdd(baseRdd)
baseRdd = sc.emptyRDD
chunkSize = 0
}
}
Upvotes: 0