Does spark read the same file twice, if two stages are using the same RDD?

Question

In Spark, the following code

rdd = sc.textfile("file path")
rdd1 = rdd.filter(filter1).maptopair();
rdd2 = rdd.filter(filter2).maptopair();
rdd3 = rdd1.join(rdd2);
rdd3.saveastextfile();

will generate 3 stages. From the Spark Web UI, I saw stage 1 and stage 2 are running parallel and join stage (stage 3) will be triggered after the first two are done. My question is both stage 1 and stage 2 read the same file at the same time ? That means Spark read the same file twice ?

Justin Pihony · Accepted Answer

TL;DR; yes it will read it twice.

The longer answer is that if the initial read is already in memory (cache/OS cache) then it will use that over a direct read. Without digging into the implementation your specific scenario would most likely result in simultaneous reads. That said, the reason is exactly the reason for DataFrames having been created. The code is a black box, so beyond the partially-shared lineage the overall stage (read and map*) is different as far as the scheduler is concerned. And, as already mentioned, it will reuse any lineage that is already cached, when it can.

If you want something more shared, use DataFrames and it's intimate knowledge of the full lineage and where it can benefit from merging operations. For example if you take your code and push it through SQL then you see the merging you are looking for.

*I assume you meant map instead of filter as join wouldn't work otherwise.

Does spark read the same file twice, if two stages are using the same RDD?

Answers (1)

Related Questions