Pyspark Databricks optimization techniques

Question

below my code snippet.

spark.read.table('schema.table_1').createOrReplaceTempView('d1') # 400 million records
spark.read.table('schema.table_2').createOrReplaceTempView('d1') $ 300 million records

stmt = "select * from d1 inner join d2 on d1.id = d2.id"

(
    spark.sql(stmt).write('delta').mode('overwrite').saveAsTable('schema.table_3') # result count : 800 million records
    
)

cluster size is (32 GB memory , 4 cores and 6 workers)

DAG

from DAG picture .

stage -219 - it is taking 1 hours
Stage -216 and 217 are skipped

Question is

In stage -219 is referring write operation Or it is executing sql statement and trying to write the result into target table -
How to identify whether joining operation is taking more time or write the result to target table is taking more time .
Based on DAG , stage-218 it is taking 40 mins.

Pyspark Databricks optimization techniques

Answers (1)

Related Questions