Autonomous
Autonomous

Reputation: 9075

Joining a large and a massive spark dataframe

My problem is as follows:

What is the best possible way to achieve the join operation?

What have I tried?

After all this, the join still doesn't work. I am still learning PySpark so I might have misunderstood the fundamentals behind repartitioning. If someone could shed light on this, it would be great.

P.S. I have already seen this question but that does not answer this question.

Upvotes: 7

Views: 8872

Answers (1)

Avishek Bhattacharya
Avishek Bhattacharya

Reputation: 6994

Details table has 900k items with 75k distinct entries in column A. I think the filter on the column A you have tried is a correct direction. However, the collect and followed by the map operation

attributes_filtered = attributes.filter(attributes.A.isin(*uniqueA)) 

this is too expensive. An alternate approach would be

uniqueA = details.select('A').distinct().persist(StorageLevel.DISK_ONLY)
uniqueA.count // Breaking the DAG lineage
attrJoined = attributes.join(uniqueA, "inner")

Also, you probably need to set the shuffle partition correctly if you haven't done that yet.

One problem could happen in your dataset is that skew. It could happen among 75k unique values only a few joining with a large number of rows in the attribute table. In that case join could take much longer time and may not finish.

To resolve that you need to find the skewed values of column A and process them separately.

Upvotes: 7

Related Questions