Reputation: 365
I am planning to save some data frames/tables to cache in Spark. I would like to know how many dataframes/tables are cached?
Upvotes: 5
Views: 6650
Reputation: 2736
You can call the underlying java object in pySpark
[{
"name": s.name(),
"memSize_MB": float(s.memSize())/ 2**20 ,
"memSize_GB": float(s.memSize())/ 2**30,
"diskSize_MB": float(s.diskSize())/ 2**20,
"diskSize_GB": float(s.diskSize())/ 2**30,
"numPartitions": s.numPartitions(),
"numCachedPartitions": s.numCachedPartitions(),
"callSite": s.callSite(),
"externalBlockStoreSize": s.externalBlockStoreSize(),
"id": s.id(),
"isCached": s.isCached(),
"parentIds": s.parentIds(),
"scope": s.scope(),
"storageLevel": s.storageLevel(),
"toString": s.toString()
} for s in sc._jsc.sc().getRDDStorageInfo()]
See Spark Java Docs for more info.
modified from zero323's answer https://stackoverflow.com/a/42003733/5060792
Upvotes: 5
Reputation: 2598
You can follow what Brian said. As per Pyspark, it doesn't have the 'sc.getPersistentRDDs
' method like the Scala API.
You can track the issue here
Upvotes: 1
Reputation: 3402
One can see details of cached RDDs/Dataframes via the Spark UI's storage tab or via the REST API.
Upvotes: 0