Why Spark needs local file to be present on every node?

Question

As the driver program executes "sc.textFile" why do we need file to be present on every node? If we copy it to every node then how does spark handle the execution on duplicate data?

evgenii · Accepted Answer

From the spark perspective there are no duplicates

On a driver it'll decide how many partitions you need and split the file accordingly. On a driver you'll get to know there are partitions like

a.file -    0 to 1000
a.file - 1001 to 2000
a.file - 2001 to 3000

Later on each executor will have a path to a file and specific chunk to read. They don't know you don't use shared file system. The only thing that matters is to have a path to the file and know where to read it. It might happen that you end up with just one executor, but it all happens the same way. That only one executor will have a file location and chunk to read. One by one until the whole file is processed.

It works exactly same way with HDFS (I'm assuming replication factor is 1), but with HDFS it is indeed just one directory with just one file (sit on a specific machine). And all of the executors go to that directory. When replication factor is more, than 1, then from the spark perspective it's still just one directory, but requests would come to different nodes - there, where copies of the file are.

Why Spark needs local file to be present on every node?

Answers (2)

Related Questions