Reputation: 18108
If we have, say, :
val rdd1 = rdd0.map( ...
followed by
val rdd2 = rdd1.filter( ...
Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.
Upvotes: 1
Views: 165
Reputation:
Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.
For anything to be computed you have to execute an action on rdd2
or one of its descendants.
By default there are also forgetful, so unless you cache
rdd1
it will be evaluated all over again, every time rdd2
is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.
Upvotes: 3