Reputation: 907

How to rate limit a Spark map operation?

I have an S3 json dataset that is a dump of a KMS client-side encrypted DynamoDB (i.e each record is KMS client-side encrypted independently).

I would like to use Spark to load that dataset to perform some analysis which means I have to call KMS to decrypt each record. Having a udf that simply decrypts each line works but hits the KMS API limit of 100 calls/sec

I am wondering if there is someway to rate limit these Spark map operations?

Upvotes: 2

Answers (1)

Indrajit Swain

Reputation: 1483

I think this can be handled by Spark streaming application.

check spark.streaming.backpressure.enabled and spark.streaming.receiver.maxRate

Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5). This enables the Spark Streaming to control the receiving rate based on the current batch scheduling delays and processing times so that the system receives only as fast as the system can process. Internally, this dynamically sets the maximum receiving rate of receivers. This rate is upper bounded by the values spark.streaming.receiver.maxRate and spark.streaming.kafka.maxRatePerPartition if they are set (see below).

when you want to set the maximum streaming 100 calls/sec

Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide in the Spark Streaming programing guide for mode details

deploying-applications

Upvotes: 1

How to rate limit a Spark map operation?

Answers (1)

Related Questions