erikejan
erikejan

Reputation: 27

Calculate rate of data processing from a Spark (Structured) Streaming Application

TL;DR What is best practice in regards to finding the maximum rate of incoming data that an Apache Spark data pipeline can handle?

I have written two Apache Spark pipelines for streaming data (one using Structured Streaming and one using Streaming). The pipelines receive streaming data from a socket connection. For local testing purposes I pipe a file to a ncat server in two ways:

  1. I pipe the file line-by-line with a slight delay between lines.
  2. I pipe the entire file of approximately 5000 data points at once.

These two streaming approaches (one fast, and one slightly slower) have very different results for both the Streaming and Structured Streaming pipelines. The delayed line-by-line stream (1) allows the pipelines to completely process all information, while the dump (2) results in only a fraction of the data points being handled (with a substantial portion of the data being completely lost).

This seems to indicate that both pipelines have issues "keeping up" with the rate of the complete file dump (2) and that the results of the pipeline have to do with the rate of the incoming data. I would obviously like to be as close to this max rate as possible, without going over.

My question is: how do I find the maximum data processing rate of a Apache Structured Streaming/Apache Streaming pipeline setup?

Upvotes: 3

Views: 970

Answers (1)

wandermonk
wandermonk

Reputation: 7346

On reading and understanding your question you want to find the rate at which your spark streaming job is processing. You have something called PIDRateEstimator which acts a feedback loop for your Spark application when enabled with BackPressure. Setting backpressure is more meaningful in the case of old spark streaming versions where you need receivers to consume the messages from a stream. From Spark 1.3 you have the receiver-less “direct” approach to ensure stronger end-to-end guarantees. So you do not need to worry about backpressure as spark does most of the fine tuning. Please read more about PIDEstimators from the below links

https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/scheduler/rate/PIDRateEstimator.scala

https://vanwilgenburg.wordpress.com/2015/10/06/spark-streaming-backpressure/

For rate limiting, you can use the Spark configuration variable spark.streaming.kafka.maxRatePerPartition to set the maximum number of messages per partition per batch.

Upvotes: 2

Related Questions