Reputation: 8318

Azure Synapse Apache Spark : Pipeline level spark configuration

Trying to configure spark for the entire azure synapse pipeline, Found Spark session config magic command and How to set Spark / Pyspark custom configs in Synapse Workspace spark pool . %%configure magic command works fine for a single notebook. Example:

Insert cell with the below content at the Beginning of the notebook

%%configure -f
{
    "driverMemory": "28g",
    "driverCores": 4,
    "executorMemory": "32g",
    "executorCores": 4,
    "numExecutors" : 5
}

Then the below emits expected values.

spark_executor_instances = spark.conf.get("spark.executor.instances")
print(f"spark.executor.instances {spark_executor_instances}")

spark_executor_memory = spark.conf.get("spark.executor.memory")
print(f"spark.executor.memory {spark_executor_memory}")

spark_driver_memory = spark.conf.get("spark.driver.memory")
print(f"spark.driver.memory {spark_driver_memory}")

Although if i add that notebook as a first activity in Azure Synapse Pipeline, what happens is that Apache Spark Application which executes that notebook has correct configuration, but the rest of the notebooks in pipeline fall back to default configuration.

How can i configure spark for the entire pipeline ? Should i copy paste above %%configure .. in each and every notebook in pipeline or is there a better way ?

Upvotes: 4

Answers (2)

David Beavon

Reputation: 1205

I want to follow up on the comments saying that Synapse will only allow you to reserve vcores in multiples of 4.

This was a bug in the past. There was some sort of "rounding" behavior where the calculations of vcores used by executors and drivers were only working properly for multiples of 4. But on 9/27/2023 I received the following update from CSS.

"We have an update from the PG team that the microservice responsible for rounding off the cores to the nearest available size has been modified to accommodate smaller container sizes, we have also deployed the new bits and currently the release has reached the east us region, so it will be complete shortly and you will be able to see the improvements."

To make a long story short, it is possible that Livy will start behaving better, when submitting jobs for arbitrarily-sized executors. It is also possible that this will cause the spark pool to auto-size up to the max number of nodes (via Yarn).

Upvotes: 0

Utkarsh Pal

Reputation: 4552

Yes, this is the well known option AFAIK. You need to define %%configure -f at the beginning of each Notebook in order to override default settings for your Job.

Alternatively, you can try by traversing to the Spark pool on Azure Portal and set the configurations in the spark pool by uploading text file which looks like this:

Please refer this third-party article for more details.

Moreover, looks like one cannot specify less than 4 cores for executor, nor driver. If you do, you get 1 core but nevertheless 4 core is reserved.

Upvotes: 1

Azure Synapse Apache Spark : Pipeline level spark configuration

Answers (2)

Related Questions