Reputation: 365

How to check the list of cache data frames/rdds/tables in Spark?

I am planning to save some data frames/tables to cache in Spark. I would like to know how many dataframes/tables are cached?

Upvotes: 5

Answers (3)

Clay

Reputation: 2736

You can call the underlying java object in pySpark

[{
    "name": s.name(),     
    "memSize_MB": float(s.memSize())/ 2**20 , 
    "memSize_GB": float(s.memSize())/ 2**30, 
    "diskSize_MB": float(s.diskSize())/ 2**20, 
    "diskSize_GB": float(s.diskSize())/ 2**30, 
    "numPartitions": s.numPartitions(), 
    "numCachedPartitions": s.numCachedPartitions(),
    "callSite": s.callSite(),
    "externalBlockStoreSize": s.externalBlockStoreSize(),
    "id": s.id(),
    "isCached": s.isCached(),
    "parentIds": s.parentIds(),
    "scope": s.scope(),
    "storageLevel": s.storageLevel(),
    "toString": s.toString()
} for s in sc._jsc.sc().getRDDStorageInfo()]

See Spark Java Docs for more info.

modified from zero323's answer https://stackoverflow.com/a/42003733/5060792

Upvotes: 5