user3385082
user3385082

Reputation: 21

Batching DStreams Out to External Systems

What would be the best way to go about batching RDD content to text files of approximately 100MB after which would be uploaded to S3? dstream.foreachRDD seems to only allow processing per RDD, and does not let me accumulate RDDs until a certain size.

Perhaps I'm missing something. Apache spark-streaming concepts are still pretty new and unfamiliar to me. I want to make a streaming application that takes data from kafka, batch messages into large files then upload them online.

Related question: dstream.foreachRDD runs func on the driver application, according to documentation. Does this mean I can only have one node in a spark cluster perform all the uploading? Doesn't that mean I'll be heavily network i/o capped?

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

Source: http://spark.apache.org/docs/latest/streaming-programming-guide.html

Upvotes: 0

Views: 65

Answers (1)

David Griffin
David Griffin

Reputation: 13927

What about using RDD.union to collect each RDD into a base RDD? Something like this:

var baseRdd: RDD[String] = sc.emptyRDD
var chunkSize = 0
val threshold = 1000000

dstream.foreachRDD { newRdd =>
  baseRdd = baseRdd.union(newRdd)
  chunkSize = chunkSize + calculateBatchSize(newRdd)
  if (chunkSize > threshold) {
    writeOutRdd(baseRdd)
    baseRdd = sc.emptyRDD
    chunkSize = 0
  }
}

Upvotes: 0

Related Questions