Is it possible to set the default storage level in Spark?

Question

In Spark it is possible to explicitly set the storage level for RDDs and Dataframes, but is it possible to change the default storage level? If so, how can it be achieved? If not, why is that not a possibility?

There are similar questions asked here and there but the answers are only referring to that the solution is to explicitly set the storage level without further explanation.

Som · Accepted Answer

I would suggest to take a look at the CacheManager.scala#cacheQuery(..). The method definition and doc looks as below-

/**
   * Caches the data produced by the logical representation of the given [[Dataset]].
   * Unlike `RDD.cache()`, the default storage level is set to be `MEMORY_AND_DISK` because
   * recomputing the in-memory columnar representation of the underlying table is expensive.
   */
  def cacheQuery(
      query: Dataset[_],
      tableName: Option[String] = None,
      storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = writeLock {
    ...
    }
  }

Here if you observe the spark internally doesn't use any configuration to fetch the default storage level rather its default value is hardcoded in the source itself.

Since there is no configuration available to override the default behaviour. there is only option remains to pass the storage level while persisting the dataframe/ RDD.

Is it possible to set the default storage level in Spark?

Answers (2)

Related Questions