Reputation: 169
I have the Azure Synapse Workspace and a small spark pool within it. I have written the code in such a way that the same spark notebook connecting to the same spark pool would be called multiple times based on the parameter that I pass from the Synapse pipeline.
Now the problem is two pipelines start at the same time but the notebook activity runs sequentially causing the second instance to be "queued" as shown below -
How can I make it parallel so my notebook starts at a time from different pipelines? More information -
Notebook code -
import logging
import findspark
findspark.init()
findspark.find()
from pyspark.sql import SparkSession
from data_mesh_etl import table1, table2
spark = SparkSession.builder \
.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.sqlserver:mssql-jdbc:9.4.1.jre11,org.apache.hadoop:hadoop-azure:3.3.1") \
.getOrCreate()
spark.conf.set('spark.sql.caseSensitive', True)
spark.conf.set('spark.sql.debug.maxToStringFields', 3000)
logger = logging.getLogger()
logger.setLevel(logging.INFO)
if p_table_name == 'table1':
table1.load_table1_data_into_sql(spark, logger)
if p_table_name == 'table2':
table2.load_table2_data_into_sql(spark, logger)
I pass the parameter p_table_name
from pipeline_table1
with the value as table1
and from pipeline_table2
with the value as table2
When these 2 pipelines start at a time, shouldn't my notebook also have 2 instances running in parallel? Is there any concurrency setting of Spark that I am missing here?
Can someone please help in this?
TIA!
Sanket Kelkar
Upvotes: 1
Views: 1907
Reputation: 169
Answering my own question here - I got it working by simply increasing the size of the spark pool. See the attached screenshot where 4 spark job definitions were called at the same time, and they ran parallelly.
Note - In here I have tried the spark job definition but the same thing works on Notebooks as well.
Upvotes: 1