Spark Streaming Processing Time vs Total Delay vs Processing Delay

I am trying to understand what the different metrics that Spark Streaming outputs mean and I am slightly confused what is the difference between the Processing Time, Total Delay and Processing Delay of the last batch ?

I have looked at the Spark Streaming guide which mentions the Processing Time as a key metric for figuring if the system is falling behind, but other places such as "Pro Spark Streaming: The Zen of Real-Time Analytics Using Apache Spark" speak about using Total Delay and Processing Delay. I have failed to find any documentation that lists all the metrics produced by Spark Streaming with explanation what each one of them means.

I would appreciate if someone can outline what each of these three metrics means or point me to any resources that can help me understand that.

Upvotes: 8

Answers (3)

Th.

Reputation: 21

If your window is 1 minute, and the average processing time is 1 minute 7 seconds, you have a problem : each batch will delay the next one by 7 seconds.

Your processing time graph shows a stable processing time, but always higher than batch time.

I think after a given amount of time, your driver will crash on GC overhead limit exceeded, as it will be full of pending batch waiting to be excecuted.

You can change this by reducing the processing time so that it goes under the expected microbatch max duration (requires code and/or resources allocation changes), or increase the microbatch size, or go to continuous streaming.

Rgds

Upvotes: 0

Tomas Bartalos

Reputation: 1276

We're experiencing stable processing time, however increasing scheduling delay.

Based on the answer, the scheduling delay should be influenced only by processing time of previous runs.

Spark is running only streaming, nothing else.

Time window is 1 minute, processing 120K records.

Upvotes: 0

Yuval Itzchakov

Reputation: 149538

Let's break down each metric. For that, let's define a basic streaming application which reads a batch at a given 4 second interval from some arbitrary source, and computes the classic word count:

inputDStream.flatMap(line => line.split(" "))
            .map(word => (word, 1))
            .reduceByKey(_ + _)
            .saveAsTextFile("hdfs://...")

Processing Time: The time it takes to compute a given batch for all its jobs, end to end. In our case this means a single job which starts at flatMap and ends at saveAsTextFile, and assumes as a prerequisite that the job has been submitted.
Scheduling Delay: The time taken by Spark Streaming scheduler to submit the jobs of the batch. How is this computed? As we've said, our batch reads from the source every 4 seconds. Now let's assume that a given batch took 8 seconds to compute. This means that we're now 8 - 4 = 4 seconds behind, thus making the scheduling delay 4 seconds long.
Total Delay: This is Scheduling Delay + Processing Time. Following the same example, if we're 4 seconds behind, meaning our scheduling delay is 4 seconds, and the next batch took another 8 seconds to compute, this means that the total delay is now 8 + 4 = 12 seconds long.

A live example from a working Streaming application:

We see that:

The bottom job took 11 seconds to process. So now the next batches scheduling delay is 11 - 4 = 7 seconds.
If we look at the second row from the bottom, we see that scheduling delay + processing time = total delay, in that case (rounding 0.9 to 1) 7 + 1 = 8.

Upvotes: 19

Spark Streaming Processing Time vs Total Delay vs Processing Delay

Answers (3)

Related Questions