How to make sure my DataFrame frees its memory?

I have a Spark/Scala job in which I do this:

1: Compute a big DataFrame df1 + cache it into memory
2: Use df1 to compute dfA
3: Read raw data into df2 (again, its big) + cache it

When performing (3), I do no longer need df1. I want to make sure its space gets freed. I cached at (1) because this DataFrame gets used in (2) and its the only way to make sure I do not recompute it each time but only once.

I need to free its space and make sure it gets freed. What are my options?

I thought of these, but it doesn't seem to be sufficient:

df=null
df.unpersist()

Can you document your answer with a proper Spark documentation link?

Upvotes: 13

Answers (3)

Ayomal Praveen

Reputation: 19

df.unpersist(blocking = true) This will solve the issue

For further explanation -> https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/

Upvotes: 0

puhlen

Reputation: 8529

df.unpersist should be sufficient, but it won't necessarily free it right away. It merely marks the dataframe for removal.

You can use df.unpersist(blocking = true) which will block until the dataframe is removed before continuing on.

Upvotes: 18

Nazarii Bardiuk

Reputation: 4342

User of Spark has no way to manually trigger garbage collection.

Assigning df=null is not going to release much memory, because DataFrame does not hold data - it is just a description of computation.

If your application has memory issue have a look at Garbage Collection tuning guide. It has suggestion where to start and what can be changed to improve GC

Upvotes: 5

How to make sure my DataFrame frees its memory?

Answers (3)

Related Questions