Richard H
Richard H

Reputation: 67

Databricks - automatic parallelism and Spark SQL

I have a general question about Databrick cells and auto-parallelism with Spark SQL. I have a summary table that has a number of fields of which most have a complex logic behind them.

If I put blocks (%SQL) of individual field logic in individual cells, will the scheduler automatically attempt to allocate the cells to different nodes on the cluster to improve performance ( depending on how many nodes my cluster has) ? Alternatively are their PySpark functions I can use to organise the parallel running myself ? I cant find much about this elsewhere...

I am using LTS 10.4 (Spark 3.2.1 Scala 2.12)

Many thanks Richard

Upvotes: 0

Views: 3888

Answers (2)

Kumar
Kumar

Reputation: 1

You can try run the sql statements using spark.sql() and assign the outputs to different dataframes. In the last step, you could execute an operation (for ex: join) that brings all into one dataframe. The lazy evaluation should then evaluate all dataframes (i.e. your sql queries) in parallel.

Upvotes: 0

restlessmodem
restlessmodem

Reputation: 448

If you write python "pyspark" code over multiple cells there is something called "lazy execution" meaning the actual work only happens at the last possible moment (for example when data is written or displayed). So before you run for example a display(df) no actual work is done on the cluster. So technically here the code of multiple code cells is parallelized efficiently.

However, in Databricks Spark SQL a single cell is executed to completion before the next one is started. If you want to run those concurrently you can take a look at running multiple notebooks at the same time (or multiple parameterized instances of the same notebook) with dbutils.notebook.run(). Then the cluster will automatically split the resources evenly between those queries running at the same time.

Upvotes: 1

Related Questions