dinesh028
dinesh028

Reputation: 2187

Why persist () are lazily evaluated in Spark

I understood the point that in Spark there are 2 types of operations

  1. Transformations
  2. Actions

Transformations like map(), filter() are evaluated lazily. So, that optimization can be done on Action execution. For example, if I execute action first() then Spark will optimize to read only the first line.

But why persist() operation is evaluated lazily? Because either way I go, eagerly or lazily, it will persist the entire RDD as per Storage level.

Can you please detail to me why persist() is transformation instead of action?

Upvotes: 15

Views: 11973

Answers (3)

Rüdiger Klaehn
Rüdiger Klaehn

Reputation: 12565

If you have some data that you might or might not use, making persist() eager would be inefficient. A normal Spark transformation corresponds to a def in Scala. A persist turns it into a lazy val.

Upvotes: 6

zero323
zero323

Reputation: 330353

For starters eager persistence would pollute a whole pipeline. cache or persist only expresses intention. It doesn't mean we'll ever get to the point when RDD is materialized and can be actually cached. Moreover there are contexts where data is cached automatically.

Because either ways I go, eagerly or lazily, it is going to persist entire RDD as per Storage level.

It is not exactly true. Thing is, persist is not persistent. As it is clearly stated in the documentation for MEMORY_ONLY persistence level:

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed.

With MEMORY_AND_DISK remaining data is stored to the disk but still can be evicted if there is not enough memory for subsequent caching. What is even more important:

Spark automatically monitors cache usage on each node and drops out old data partitions in a least-recently-used (LRU) fashion.

You can also argue that cache / persist is semantically different from Spark actions which are executed for specific IO side-effects. cache is more a hint for a Spark engine that we may want to reuse this piece of code later.

Upvotes: 21

Stéphane
Stéphane

Reputation: 877

persist is confusing name as it does not persist beyond your context lifetime.

persist is the same as cache. Data is cached the first time it is computed so that if you use your RDD in another computation , the results are not recalculated.

Upvotes: 1

Related Questions