Reputation:

What is the difference between cache and persist?

In terms of RDD persistence, what are the differences between cache() and persist() in spark ?

Upvotes: 251

Answers (7)

Shravan Uppin

Reputation: 41

Both cache() and persist() are 2 methods to persist cache or dataframe into memory or disk. cache() method is shorthand for persist() with default storage level MEMORY_ONLY while persist provide flexibility by allowing to specify the storage level explicitly

Spark RDD or Dataframe are lazily evaluated and sometimes we wish to use the same RDD or DATAFRAME multiple times.If we do naively , spark will recompute the RDD and all its dependencies each time we call an action on the RDD. This is expensive if you are using your RDD or Dataframe iteratively. To avoid computing an RDD multiple times, we can ask spark to persist the data.when we ask spark to persist an RDD or DATAFRAME, the nodes that computes the RDD or Dataframe store their partitions.

when you use persist instead cache(default MEMORY_ONLY) ,with persist there are different storage levels

Level	Space Used	CPU Time	In Memory	On Disk	Comments
MEMORY_ONLY	HIGH	LOW	Y	N
MEMORY_ONLY_SER	LOW	HIGH	Y	N	Data serialized to save space
MEMORY_AND_DISK	HIGH	MEDIUM	Some	Some	Splits to disk if data can't fit into worker node memory
MEMORY_AND_DISK_SER	LOW	HIGH	Some	Some	Same as MEMORY_AND_DISK, but write data in serialized format to save space
DISK_ONLY	LOW	HIGH	N	Y

Upvotes: 0

Ram Ghadiyaram

Reputation: 29237

The difference between cache and persist operations is purely syntactic. cache is a synonym of persist or persist(MEMORY_ONLY), i.e. cache is merely persist with the default storage level MEMORY_ONLY

But Persist() We can save the intermediate results in 5 storage levels.

MEMORY_ONLY

MEMORY_AND_DISK

MEMORY_ONLY_SER

MEMORY_AND_DISK_SER

DISK_ONLY

/** * Persist this RDD with the default storage level (MEMORY_ONLY). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** * Persist this RDD with the default storage level (MEMORY_ONLY). */
def cache(): this.type = persist()

see more details here...

Caching or persistence are optimization techniques for (iterative and interactive) Spark computations. They help saving interim partial results so they can be reused in subsequent stages. These interim results as RDDs are thus kept in memory (default) or more solid storage like disk and/or replicated. RDDs can be cached using cache operation. They can also be persisted using persist operation.

#persist, cache

These functions can be used to adjust the storage level of a RDD. When freeing up memory, Spark will use the storage level identifier to decide which partitions should be kept. The parameter less variants persist() and cache() are just abbreviations for persist(StorageLevel.MEMORY_ONLY).

Warning: Once the storage level has been changed, it cannot be changed again!

Warning -Cache judiciously... see ((Why) do we need to call cache or persist on a RDD)

Just because you can cache a RDD in memory doesn’t mean you should blindly do so. Depending on how many times the dataset is accessed and the amount of work involved in doing so, recomputation can be faster than the price paid by the increased memory pressure.

It should go without saying that if you only read a dataset once there is no point in caching it, it will actually make your job slower. The size of cached datasets can be seen from the Spark Shell..

Listing Variants...

def cache(): RDD[T]
 def persist(): RDD[T]
 def persist(newLevel: StorageLevel): RDD[T]

See below example :

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)
     c.getStorageLevel
     res0: org.apache.spark.storage.StorageLevel = StorageLevel(false, false, false, false, 1)
     c.cache
     c.getStorageLevel
     res2: org.apache.spark.storage.StorageLevel = StorageLevel(false, true, false, true, 1)

Note : Due to the very small and purely syntactic difference between caching and persistence of RDDs the two terms are often used interchangeably.

See more visually here....

Persist in memory and disk:

Cache:

Caching can improve the performance of your application to a great extent.

In general, it is recommended to use persist with a specific storage level to have more control over caching behavior, while cache can be used as a quick and convenient way to cache data in memory.

Upvotes: 100

jack

Reputation: 1951

For impatient:

Same

Without passing argument, persist() and cache() are the same, with default settings:

when RDD: MEMORY_ONLY
when Dataset: MEMORY_AND_DISK

Difference:

Unlike cache(), persist() allows you to pass argument inside the bracket, in order to specify the level:

persist(MEMORY_ONLY)
persist(MEMORY_ONLY_SER)
persist(MEMORY_AND_DISK)
persist(MEMORY_AND_DISK_SER )
persist(DISK_ONLY )

Voilà!

Upvotes: 5

ahars

Reputation: 2776

With cache(), you use only the default storage level :

MEMORY_ONLY for RDD
MEMORY_AND_DISK for Dataset

With persist(), you can specify which storage level you want for both RDD and Dataset.

From the official docs:

You can mark an RDD to be persisted using the persist() or cache() methods on it.

each persisted RDD can be stored using a different storage level

The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory).

Use persist() if you want to assign a storage level other than :

MEMORY_ONLY to the RDD
or MEMORY_AND_DISK for Dataset

Interesting link for the official documentation : which storage level to choose

Upvotes: 262

user11332824

Reputation: 61

Cache() and persist() both the methods are used to improve performance of spark computation. These methods help to save intermediate results so they can be reused in subsequent stages.

The only difference between cache() and persist() is ,using Cache technique we can save intermediate results in memory only when needed while in Persist() we can save the intermediate results in 5 storage levels(MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY).

Upvotes: 6

ketankk

Reputation: 2674

Spark gives 5 types of Storage level

MEMORY_ONLY
MEMORY_ONLY_SER
MEMORY_AND_DISK
MEMORY_AND_DISK_SER
DISK_ONLY

cache() will use MEMORY_ONLY. If you want to use something else, use persist(StorageLevel.<*type*>).

By default persist() will store the data in the JVM heap as unserialized objects.

Upvotes: 25

Mike Park

Reputation: 10941

There is no difference. From RDD.scala.

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)

/** Persist this RDD with the default storage level (`MEMORY_ONLY`). */
def cache(): this.type = persist()

Upvotes: 51