Zubin Choksi
Zubin Choksi

Reputation: 11

Load data from synapse spark pool to dedicated sql pool - multiple jobs for a single write action

I have a spark code executing on Synapse spark pool as below. It reads data from dedicated SQL pool table, performs map operations and writes back data into dedicated SQL pool table.

code

Two jobs are spawned for just one write action

two jobs

In job 0 the data is written to ADLS

enter image description here

In job 1, same input data is re-read re-processed and written to target SQL pool table

DAG of job1

Why is Job 1 (second job) and the associated re-processing needed when Job 0 (first job) has already prepared the output data?

Driver logs capture the execution of below steps:

Step 1 - Staging directory for processed data (spark output) is created on Gen2

Step 2 - Input table is extracted on Gen2

Step 3 - Job 0 execution (output created in staging directory of step1)

Step 4 - Re-extract data from input table (this was already available in Step 2)

Step 5 - Load Job 0 output into SQL pool table

Step 6 - Execute Job 1 (where the same operations as Job 0 are re-executed)

Step 4 and Step 6 are essentially redundant

Upvotes: 0

Views: 1443

Answers (1)

Nikunj Kakadiya
Nikunj Kakadiya

Reputation: 2998

Two jobs are generated in the backend for just one write operation because it is internally designed in that manner.

Going into the details of what happens when you try to read and write data to dedicated sql pool from synapse spark pool. It is a two step process :

  1. The data that you are reading or writing gets saves in the Azure Storage (either it be blob storage or Gen2) as an intermediate state.
  2. Now spark reads the data that is saved in the Azure storage and do transformation on it. If you are writing the data, it uses polybase or Copy to transfer the data.

I am attaching the picture from the official documentation that would show you what the earlier approach was and what the new approach is. enter image description here

Below are some of the official documentation reference that would allow you to understand that in more detail.

https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export

https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics

Upvotes: 0

Related Questions