Manav Karthikeyan
Manav Karthikeyan

Reputation: 53

Clearing Cached Data on Databricks Cluster

enter image description here

The problem I am facing is that my "used" memory is only around 16GB, however the cached memory takes up so much space, that I am forced to use a compute with higher memory (64GB).

So I tried to disable the cache using:

spark.conf.set("spark.databricks.io.cache.enabled", "false")

I understand this only disables the IO cache and while it does lead to some reduction in the cached memory,

enter image description here

a significant amount of the cached memory still remains. Is there any way to completely disable the cache or reduce the amount of cache used?

Upvotes: 0

Views: 104

Answers (1)

JayashankarGS
JayashankarGS

Reputation: 8140

You already did the spark.conf.set("spark.databricks.io.cache.enabled", "false") below are the ways further you can remove cached data.

Run below code to remove all cached tables from the in-memory cache.

spark.catalog.clearCache()

If you know the exact dataframe object which is cached upon calling show,count.. actions then unpersist them

df.unpersist()

or

for (id, rdd) in sparkContext._jsc.getPersistentRDDs().items():
    rdd.unpersist()
    print("Unpersisted {} rdd".format(id))

Next, try to avoid actions like mentioned above which triggers the dataframe to cache.

If you have broadcast variables, try to minimize them, they call also consume memory.

Check the execution plans.

df.explain(True)

If there is InMemoryTableScan it is utilizing the cached object, unpersist all of them, remove unnecessary Exchange, ShuffledHashJoin, or BroadcastHashJoin operators causing implicit caching.

Note: Disabling the caching mechanism completely, impacts the performance of your operations which re-evaluates the dataframe frequently, make sure you remove cache which is unnecessary

Upvotes: 1

Related Questions