Batching DStreams Out to External Systems

Question

What would be the best way to go about batching RDD content to text files of approximately 100MB after which would be uploaded to S3? dstream.foreachRDD seems to only allow processing per RDD, and does not let me accumulate RDDs until a certain size.

Perhaps I'm missing something. Apache spark-streaming concepts are still pretty new and unfamiliar to me. I want to make a streaming application that takes data from kafka, batch messages into large files then upload them online.

Related question: dstream.foreachRDD runs func on the driver application, according to documentation. Does this mean I can only have one node in a spark cluster perform all the uploading? Doesn't that mean I'll be heavily network i/o capped?

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Batching DStreams Out to External Systems

Answers (1)

Related Questions