Do some spark or spark.sql operations do collect in intermediate processing?

Question

I have faced some out of memory issues in spark and most solutions tell me to reduce the collect() operations or check broadcast tables.

So I have a simple question, why does this happen when I am not using collect or broadcast tables in my code?

Dose spark perform collect in the intermediate processing for some operations?

Yoan B. M.Sc · Accepted Answer

It does, I don't have an exhaustive list but if you call .toPandas() on a Spark DF for instance, it will collect the data on the driver.

Even though you're not directly calling the collect.

Answers (1)