Asif Khan
Asif Khan

Reputation: 181

How can we remove Skewed Partitions in Spark?

I built a spark SQL query that has 17-20 joins. My driving table size is around 40GiB and the other 2-3 tables have data in 1-2 TBs and the other table has around 3-4GiB of data.

I tried this job with 16 nodes of an 8xLarge cluster(32 Cores, 128GiB memory) but with no success. I tried on this job with 16 nodes of a 16xLarge cluster(64 cores, 256GiB memory).

I dug into the issue and found that there are 2 partitions(Skewed partitions) that have large data. Can someone help me with this how can we identify the skewed portion in my query and how can we remove or split these skewed partitions?

Also, I have tried all the AQE options.

Upvotes: 1

Views: 84

Answers (1)

Brad Hein
Brad Hein

Reputation: 11057

Try adapting the query below, to see the number of records per partition field. Assume you have partitions for year, month, and day of month (y,m,d):

SELECT 
  y,m,d
  ,count(*) records
FROM
  TABLENAME
GROUP BY
  y,m,d

Given a table with an oversized (Skewed, in terms of relative size) partition, you can use the above query to identify the problem partition. With this information you can then decide how to balance the data (which depends highly on the type of data and whether it makes sense to add another partition or re-distribute the existing data (from the large partition) across other partitions if possible).

Notice I only queried and grouped by partition fields. This ensures speedy performance. Only a minimal set of metadata files need be read to glean these counts.

Upvotes: 0

Related Questions