Reputation: 1193
I am frequently dealing with containers getting killed by YARN for exceeding memory limits. I suspect it has to do with caching/unpersisting RDDS/Dataframes in an inefficient manner.
What is the best way to debug this type of issue?
I have looked at the "Storage" tab in the Spark Web UI, but the "RDD Names" don't get any more descriptive than "MapPartitionsRDD" or "UnionRDD". How do I figure out which specific RDDs take up the most space in the cache?
In order to figure out the Out of Memory errors, I will need to figure out which RDDs are taking up the most space in the cache. I also want to be able to track when they get unpersisted.
Upvotes: 0
Views: 277
Reputation: 26
For the RDDs
you can set meaningful names using setName
method:
val rdd: RDD[T] = ???
rdd.setName("foo")
For catalog
backed tables:
val df: DataFrame = ???
df.createOrReplaceTempView("foo")
spark.catalog.cacheTable("foo")
the name in the catalog will be reflected in both UI and SparkContext.getPersistentRDD
.
I am not aware of any solution which works for standalone Datasets
.
Upvotes: 1