Learn Hadoop
Learn Hadoop

Reputation: 3060

Pyspark Databricks optimization techniques

below my code snippet.

spark.read.table('schema.table_1').createOrReplaceTempView('d1') # 400 million records
spark.read.table('schema.table_2').createOrReplaceTempView('d1') $ 300 million records

stmt = "select * from d1 inner join d2 on d1.id = d2.id"

(
    spark.sql(stmt).write('delta').mode('overwrite').saveAsTable('schema.table_3') # result count : 800 million records
    
)

cluster size is (32 GB memory , 4 cores and 6 workers)

DAG enter image description here

from DAG picture .

  1. stage -219 - it is taking 1 hours
  2. Stage -216 and 217 are skipped

Question is

  1. In stage -219 is referring write operation Or it is executing sql statement and trying to write the result into target table -
  2. How to identify whether joining operation is taking more time or write the result to target table is taking more time .
  3. Based on DAG , stage-218 it is taking 40 mins.

Upvotes: 0

Views: 55

Answers (1)

Vamsi Bitra
Vamsi Bitra

Reputation: 2764

First check the shuffle size in the spark for stage 218 and also check for the skew in key distribution. because high shuffle and the join operation is taking more time.

Stage 219: spends more time in writing tasks (low shuffle read/write but high I/O), because of that write operation is slower.

Best optimization techniques:

  1. Use broadcast Joins to avoid shuffle, Repartitioning even data distribution.
  2. For optimize write operation use a file format like Delta/Parquet for efficient writing data from source to destination and also increase number of output partition.
  3. use parquet file for compressing and reduces storage and query optimization scan size.
  4. As per above data size use partition . it will helps to increase optimization speed.
  5. optimize table consider small column file to reduce scan over head.

Upvotes: 0

Related Questions