Ramsey
Ramsey

Reputation: 365

How to check the list of cache data frames/rdds/tables in Spark?

I am planning to save some data frames/tables to cache in Spark. I would like to know how many dataframes/tables are cached?

Upvotes: 5

Views: 6650

Answers (3)

Clay
Clay

Reputation: 2736

You can call the underlying java object in pySpark

[{
    "name": s.name(),     
    "memSize_MB": float(s.memSize())/ 2**20 , 
    "memSize_GB": float(s.memSize())/ 2**30, 
    "diskSize_MB": float(s.diskSize())/ 2**20, 
    "diskSize_GB": float(s.diskSize())/ 2**30, 
    "numPartitions": s.numPartitions(), 
    "numCachedPartitions": s.numCachedPartitions(),
    "callSite": s.callSite(),
    "externalBlockStoreSize": s.externalBlockStoreSize(),
    "id": s.id(),
    "isCached": s.isCached(),
    "parentIds": s.parentIds(),
    "scope": s.scope(),
    "storageLevel": s.storageLevel(),
    "toString": s.toString()
} for s in sc._jsc.sc().getRDDStorageInfo()]

See Spark Java Docs for more info.

modified from zero323's answer https://stackoverflow.com/a/42003733/5060792

Upvotes: 5

Rahul
Rahul

Reputation: 2598

You can follow what Brian said. As per Pyspark, it doesn't have the 'sc.getPersistentRDDs' method like the Scala API.

You can track the issue here

Upvotes: 1

Brian Cajes
Brian Cajes

Reputation: 3402

One can see details of cached RDDs/Dataframes via the Spark UI's storage tab or via the REST API. enter image description here

Upvotes: 0

Related Questions