SPARK rdd performance pipelining

Question

If we have, say, :

val rdd1 = rdd0.map( ...

followed by

val rdd2 = rdd1.filter( ...

Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.

user9924728 · Accepted Answer

Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.

For anything to be computed you have to execute an action on rdd2 or one of its descendants.
By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.

SPARK rdd performance pipelining

Answers (1)

Related Questions