Load data from synapse spark pool to dedicated sql pool - multiple jobs for a single write action

Question

I have a spark code executing on Synapse spark pool as below. It reads data from dedicated SQL pool table, performs map operations and writes back data into dedicated SQL pool table.

Two jobs are spawned for just one write action

In job 0 the data is written to ADLS

In job 1, same input data is re-read re-processed and written to target SQL pool table

Why is Job 1 (second job) and the associated re-processing needed when Job 0 (first job) has already prepared the output data?

Driver logs capture the execution of below steps:

Step 1 - Staging directory for processed data (spark output) is created on Gen2

Step 2 - Input table is extracted on Gen2

Step 3 - Job 0 execution (output created in staging directory of step1)

Step 4 - Re-extract data from input table (this was already available in Step 2)

Step 5 - Load Job 0 output into SQL pool table

Step 6 - Execute Job 1 (where the same operations as Job 0 are re-executed)

Step 4 and Step 6 are essentially redundant

Load data from synapse spark pool to dedicated sql pool - multiple jobs for a single write action

Answers (1)

Related Questions