How to throttle Spark Streaming?

Question

this question kind of goes off my other question for managing AmazonDynamoDbClient throttles and retries. However, I think the solution might exist before I even get to the dynamo call.

My high level process is as follows: I have a scala application utilizing Apache Spark to read large CSV files and perform some aggregations on them and then write them to dynamo. I deploy this to EMR to give us scalability. The issue is that once aggregation is complete, we have millions of records ready to go into dynamo, but we have a write capacity with dynamo. They don't need to be inserted immediately, but it would be nice to control how many per second so we can fine tune it for our use case.

Here is a code sample of what I have so far:

val foreach = new ForeachWriter[Row] {
    override def process(value: Row): Unit = {
      //write to dynamo here
    }

    override def close(errorOrNull: Throwable): Unit = {
    }

    override def open(partitionId: Long, version: Long): Boolean = {
      true
    }
  }

val query = dataGrouped
    .writeStream
    .queryName("DynamoOutput")
    .format("console")
    .foreach(foreach)
    .outputMode(OutputMode.Complete())
    .start()
    .awaitTermination()

Does anyone have any recommendations how to solve this problem?

How to throttle Spark Streaming?

Answers (1)

Related Questions