Reputation: 11
I have a spark code executing on Synapse spark pool as below. It reads data from dedicated SQL pool table, performs map operations and writes back data into dedicated SQL pool table.
Two jobs are spawned for just one write action
In job 0 the data is written to ADLS
In job 1, same input data is re-read re-processed and written to target SQL pool table
Why is Job 1 (second job) and the associated re-processing needed when Job 0 (first job) has already prepared the output data?
Driver logs capture the execution of below steps:
Step 1 - Staging directory for processed data (spark output) is created on Gen2
Step 2 - Input table is extracted on Gen2
Step 3 - Job 0 execution (output created in staging directory of step1)
Step 4 - Re-extract data from input table (this was already available in Step 2)
Step 5 - Load Job 0 output into SQL pool table
Step 6 - Execute Job 1 (where the same operations as Job 0 are re-executed)
Step 4 and Step 6 are essentially redundant
Upvotes: 0
Views: 1443
Reputation: 2998
Two jobs are generated in the backend for just one write operation because it is internally designed in that manner.
Going into the details of what happens when you try to read and write data to dedicated sql pool from synapse spark pool. It is a two step process :
I am attaching the picture from the official documentation that would show you what the earlier approach was and what the new approach is.
Below are some of the official documentation reference that would allow you to understand that in more detail.
https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/synapse-spark-sql-pool-import-export
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Upvotes: 0