How to distribute dataset evenly to avoid a skewed join (and long-running tasks)?

Question

I am writing application using Spark dataset API on databricks notebook.

I have 2 tables. One is 1.5billion rows and second 2.5 million. Both tables contain telecommunication data and join is done using country code and first 5 digits of a number. Output has 55 billion rows. Problem is I have skewed data(long running tasks). No matter how i repartition dataset I get long running tasks because of uneven distribution of hashed keys.

I tried using broadcast joins, tried persisting big table partitions in memory etc.....

What are my options here?

How to distribute dataset evenly to avoid a skewed join (and long-running tasks)?

Answers (1)

Related Questions