Reputation: 1306
I have faced some out of memory issues in spark and most solutions tell me to reduce the collect() operations or check broadcast tables.
So I have a simple question, why does this happen when I am not using collect or broadcast tables in my code?
Dose spark perform collect in the intermediate processing for some operations?
Upvotes: 1
Views: 78
Reputation: 1505
It does, I don't have an exhaustive list but if you call .toPandas()
on a Spark DF for instance, it will collect the data on the driver.
Even though you're not directly calling the collect
.
Upvotes: 1