Arnab
Arnab

Reputation: 1037

How can I check whether my RDD or dataframe is cached or not?

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not? Also is there a way so that I am able to see all my cached RDD's or dataframes.

Upvotes: 23

Views: 21036

Answers (6)

John Haberstroh
John Haberstroh

Reputation: 505

The only sensible answer I've found is here:

https://stackoverflow.com/a/63037191/1524650

However, this uses an SparkSession.sharedState which is marked as "Unstable":

https://spark.apache.org/docs/3.4.1/api/scala/org/apache/spark/sql/SparkSession.html#sharedState:org.apache.spark.sql.internal.SharedState

It would seem that there's no good way to do this, then. You can check that Spark has been instructed to try to cache something, but there is no public API to check which objects are currently in cache.

Upvotes: 0

belgacea
belgacea

Reputation: 1164

You can retrieve the storage level of a RDD since Spark 1.4 and since Spark 2.1 for DataFrame.

val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel

Then you can check where it's stored as follows:

val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap

Upvotes: 4

Patrick McGloin
Patrick McGloin

Reputation: 2234

You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.

For the Dataframe do this:

scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = false

scala> df.cache()
res0: df.type = [value: int]

scala> df.storageLevel.useMemory
res1: Boolean = true

For the RDD do this:

scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res9: Boolean = false

scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21

scala> rdd.getStorageLevel.useMemory
res11: Boolean = true

Upvotes: 22

bmc
bmc

Reputation: 51

In Java and Scala, following method could used to find all the persisted RDDs: sparkContext.getPersistentRDDs()
Here is link to documentation.`

Looks like this method is not available in python yet:

https://issues.apache.org/jira/browse/SPARK-2141

But one could use this short-term hack:

sparkContext._jsc.getPersistentRDDs().items()

Upvotes: 4

Sai Kiriti Badam
Sai Kiriti Badam

Reputation: 960

Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:

dataframe.storageLevel.useMemory

Upvotes: 8

user6296218
user6296218

Reputation: 355

@Arnab,

Did you find the function in Python?
Here is an example for DataFrame DF:

DF.cache()
print DF.is_cached

Hope this helps.
Ram

Upvotes: 14

Related Questions