Andy_101
Andy_101

Reputation: 1306

Do some spark or spark.sql operations do collect in intermediate processing?

I have faced some out of memory issues in spark and most solutions tell me to reduce the collect() operations or check broadcast tables.

So I have a simple question, why does this happen when I am not using collect or broadcast tables in my code?

Dose spark perform collect in the intermediate processing for some operations?

Upvotes: 1

Views: 78

Answers (1)

Yoan B. M.Sc
Yoan B. M.Sc

Reputation: 1505

It does, I don't have an exhaustive list but if you call .toPandas() on a Spark DF for instance, it will collect the data on the driver.

Even though you're not directly calling the collect.

Upvotes: 1

Related Questions