How can we remove Skewed Partitions in Spark?

Question

I built a spark SQL query that has 17-20 joins. My driving table size is around 40GiB and the other 2-3 tables have data in 1-2 TBs and the other table has around 3-4GiB of data.

I tried this job with 16 nodes of an 8xLarge cluster(32 Cores, 128GiB memory) but with no success. I tried on this job with 16 nodes of a 16xLarge cluster(64 cores, 256GiB memory).

I dug into the issue and found that there are 2 partitions(Skewed partitions) that have large data. Can someone help me with this how can we identify the skewed portion in my query and how can we remove or split these skewed partitions?

Also, I have tried all the AQE options.

How can we remove Skewed Partitions in Spark?

Answers (1)

Related Questions