PySpark Structured Streaming 2 SQLs Per Batch (Long addBatch Execution)

Question

I have a Pyspark Structured Streaming Application (3.3.2) which needs to read input from Kafka using micro batches, perform complex logic which includes joining data from few data frames. The app is divided into 2 streaming queries:

Load the dataframes that are needed for the calculation - this is semi static data (changes few times a day) - the data is cached for performance reasons to be shared with streaming query 2
Perform the logic itself - using foreachBatch, performing the logic. Note: the plan is quite big. The problem is that each micro batch of streaming query 2, I see 2 SQLs running, with the same batch_id and run_id - the first takes 20+ seconds which I don't really understand what is it's purpose, the second runs pretty fast which looks like it is the actual query execution.

Can someone explain why there are 2 SQLs running? and why is the first one takes 20+ seconds?

My intuition was that since the plan is quite large, it takes time to handle the plan (tho I expected the plan to be "compiled" once and not re-evaluated each micro batch).

I tried the following:

Splitting the second streaming query (logic) into smaller streaming queries (with smaller plans) putting Kafka topics between them - I still saw the 2 SQLs per batch.
Using local filesystem as checkpoint, instead of a path in GCS (just for test purposes), this didn't reduce the latency of the first SQL
Turning off spark.sql.cbo.enabled feature flag

Few things to mention:

When breaking up to smaller streaming queries, the "small plan" (just few transformations) streaming queries had shorter first SQL runtime (~2 seconds which is still quite a lot)
The 2 SQLs are running partially in parallel, at some point during the first SQL execution, the second starts running in parallel.

See screenshot

PySpark Structured Streaming 2 SQLs Per Batch (Long addBatch Execution)

Answers (1)

An Example

Related Questions