Reputation: 426
In Spark, the following code
rdd = sc.textfile("file path")
rdd1 = rdd.filter(filter1).maptopair();
rdd2 = rdd.filter(filter2).maptopair();
rdd3 = rdd1.join(rdd2);
rdd3.saveastextfile();
will generate 3 stages. From the Spark Web UI, I saw stage 1 and stage 2 are running parallel and join stage (stage 3) will be triggered after the first two are done. My question is both stage 1 and stage 2 read the same file at the same time ? That means Spark read the same file twice ?
Upvotes: 2
Views: 2109
Reputation: 67075
TL;DR; yes it will read it twice.
The longer answer is that if the initial read is already in memory (cache/OS cache) then it will use that over a direct read. Without digging into the implementation your specific scenario would most likely result in simultaneous reads. That said, the reason is exactly the reason for DataFrames having been created. The code is a black box, so beyond the partially-shared lineage the overall stage (read and map*) is different as far as the scheduler is concerned. And, as already mentioned, it will reuse any lineage that is already cached, when it can.
If you want something more shared, use DataFrames and it's intimate knowledge of the full lineage and where it can benefit from merging operations. For example if you take your code and push it through SQL then you see the merging you are looking for.
*I assume you meant map
instead of filter
as join
wouldn't work otherwise.
Upvotes: 2