Andrimner
Andrimner

Reputation: 48

Is it possible to set the default storage level in Spark?

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?

There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.

Upvotes: 3

Views: 4290

Answers (2)

Som
Som

Reputation: 6323

I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-

/**
   * Caches the data produced by the logical representation of the given [[Dataset]].
   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
   * recomputing the in-memory columnar representation of the underlying table is expensive.
   */
  def cacheQuery(
      query: Dataset[_],
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
    ...
    }
  }

Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.

Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.

Upvotes: 2

Vahid Shahrivari
Vahid Shahrivari

Reputation: 138

Please check the below

[SPARK-3824][SQL] Sets in-memory table default storage level to MEMORY_AND_DISK

Using persist() you can use various storage levels to Store Persisted RDDs in Apache Spark, the level of persistence level in Spark 3.0 are below:

-MEMORY_ONLY: Data is stored directly as objects and stored only in memory.

-MEMORY_ONLY_SER: Data is serialized as compact byte array representation and stored only in memory. To use it, it has to be deserialized at a cost.

-MEMORY_AND_DISK: Data is stored directly as objects in memory, but if there’s insufficient memory the rest is serialized and stored on disk.

-DISK_ONLY: Data is serialized and stored on disk.

-OFF_HEAP: Data is stored off-heap.

-MEMORY_AND_DISK_SER: Like MEMORY_AND_DISK, but data is serialized when stored in memory. (Data is always serialized when stored on disk.)

For rdd the default storage level for persist api is MEMORY and for dataset is MEMORY_AND_DISK

for example, you can persist your data like this:

val rdd = rdd.persist(StorageLevel.OFF_HEAP)
val df2 = df.persist(StorageLevel.MEMORY_ONLY_SER)

for more information you could visit: https://spark.apache.org/docs/latest/api/java/index.html?org/apache/spark/storage/StorageLevel.html

Upvotes: 0

Related Questions