Bonson
Bonson

Reputation: 1468

Need to release the memory used by unused spark dataframes

I am not caching or persisting the spark dataframe. If I have to do many additional things in the same session by aggregating and modifying content of the dataframe as part of the process then when and how would the initial dataframe be released from memory?

Example:

I load a dataframe DF1 with 10 million records. Then I do some transformation on the dataframe which creates a new dataframe DF2. Then there are a series of 10 steps I do on DF2. All through this, I do not need DF1 anymore. How can I be sure that DF1 no longer exists in memory and hampering performance? Is there any approach using which I can directly remove DF1 from memory? Or does DF1 get automatically removed based on Least Recently Used (LRU) approach?

Upvotes: 4

Views: 2155

Answers (2)

Nathan T Alexander
Nathan T Alexander

Reputation: 257

Prompted by a question from A Pantola, in comments I'm returning here to post a better answer to this question. Note there are MANY possible correct answers how to optimize RAM usage which will depend on the work being done!

First, write the dataframe to DBFS, something like this:

spark.createDataFrame(data=[('A',0)],schema=['LETTERS','NUMBERS'])\
    .repartition("LETTERS")\
    .write.partitionBy("LETTERS")\
    .parquet(f"/{tmpdir}",mode="overwrite")

Now,

df = spark.read.parquet(f"/{tmpdir}")

Assuming you don't set up any caching on the above df, then each time spark finds a reference to df it will parallel read the dataframe and compute whatever is specified.

Note the above solution will minimize RAM usage, but may require more CPU on every read. Also, the above solution will have a cost of writing to parquet.

Upvotes: 0

Steven
Steven

Reputation: 15318

That's not how spark work. Dataframes are lazy ... the only things stored in memories are the structures and the list of tranformation you have done on your dataframes. The data are not stored in memory (unless you cache them and apply an action).

Therefore, I do not see any problem in your question.

Upvotes: 2

Related Questions