Reputation: 3060
below my code snippet.
spark.read.table('schema.table_1').createOrReplaceTempView('d1') # 400 million records
spark.read.table('schema.table_2').createOrReplaceTempView('d1') $ 300 million records
stmt = "select * from d1 inner join d2 on d1.id = d2.id"
(
spark.sql(stmt).write('delta').mode('overwrite').saveAsTable('schema.table_3') # result count : 800 million records
)
cluster size is (32 GB memory , 4 cores and 6 workers)
from DAG picture .
Question is
Upvotes: 0
Views: 55
Reputation: 2764
First check the shuffle size in the spark for stage 218 and also check for the skew in key distribution. because high shuffle and the join operation is taking more time.
Stage 219: spends more time in writing tasks (low shuffle read/write but high I/O), because of that write operation is slower.
Best optimization techniques:
Upvotes: 0