Michael
Michael

Reputation: 42050

How to improve this Spark pipeline?

Suppose I am joining a few Spark data frames like that:

abcd = a.join(b, 'bid', 'inner')\
        .join(c, 'cid', 'inner')\
        .join(d, 'did', 'left')\
        .distinct() 
abcd.head() # takes 5-7 min.

The head invocation triggers the pipeline execution that takes 5-7 min. Does it have anything to do with those joins ? How would you make the pipeline faster ?

Upvotes: 1

Views: 56

Answers (1)

vvg
vvg

Reputation: 6385

head() returns just one record. You don't need distinct(), if you need just first record. It might save you from expensive shuffle.

However, considering you have joins above, and resulted dataset is not sorted - there are no guarantees what record will be returned.

Upvotes: 1

Related Questions