Reputation: 1037
I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not? Also is there a way so that I am able to see all my cached RDD's or dataframes.
Upvotes: 23
Views: 21036
Reputation: 505
The only sensible answer I've found is here:
https://stackoverflow.com/a/63037191/1524650
However, this uses an SparkSession.sharedState
which is marked as "Unstable":
It would seem that there's no good way to do this, then. You can check that Spark has been instructed to try to cache something, but there is no public API to check which objects are currently in cache.
Upvotes: 0
Reputation: 1164
You can retrieve the storage level of a RDD
since Spark 1.4 and since Spark 2.1 for DataFrame
.
val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel
Then you can check where it's stored as follows:
val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap
Upvotes: 4
Reputation: 2234
You can call getStorageLevel.useMemory
on the Dataframe and the RDD to find out if the dataset is in memory.
For the Dataframe do this:
scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = false
scala> df.cache()
res0: df.type = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = true
For the RDD do this:
scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res9: Boolean = false
scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
Upvotes: 22
Reputation: 51
In Java and Scala, following method could used to find all the persisted RDDs:
sparkContext.getPersistentRDDs()
Here is link to documentation.`
Looks like this method is not available in python yet:
https://issues.apache.org/jira/browse/SPARK-2141
But one could use this short-term hack:
sparkContext._jsc.getPersistentRDDs().items()
Upvotes: 4
Reputation: 960
Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:
dataframe.storageLevel.useMemory
Upvotes: 8
Reputation: 355
@Arnab,
Did you find the function in Python?
Here is an example for DataFrame DF:
DF.cache() print DF.is_cached
Hope this helps.
Ram
Upvotes: 14