Reputation: 42050
Suppose I am joining a few Spark
data frames like that:
abcd = a.join(b, 'bid', 'inner')\
.join(c, 'cid', 'inner')\
.join(d, 'did', 'left')\
.distinct()
abcd.head() # takes 5-7 min.
The head
invocation triggers the pipeline execution that takes 5-7 min. Does it have anything to do with those joins
? How would you make the pipeline faster ?
Upvotes: 1
Views: 56
Reputation: 6385
head()
returns just one record.
You don't need distinct()
, if you need just first record.
It might save you from expensive shuffle.
However, considering you have joins above, and resulted dataset is not sorted - there are no guarantees what record will be returned.
Upvotes: 1