Reputation: 312
As the driver program executes "sc.textFile" why do we need file to be present on every node? If we copy it to every node then how does spark handle the execution on duplicate data?
Upvotes: 1
Views: 852
Reputation: 1235
From the spark perspective there are no duplicates
On a driver it'll decide how many partitions you need and split the file accordingly. On a driver you'll get to know there are partitions like
a.file - 0 to 1000
a.file - 1001 to 2000
a.file - 2001 to 3000
Later on each executor will have a path to a file and specific chunk to read. They don't know you don't use shared file system. The only thing that matters is to have a path to the file and know where to read it. It might happen that you end up with just one executor, but it all happens the same way. That only one executor will have a file location and chunk to read. One by one until the whole file is processed.
It works exactly same way with HDFS (I'm assuming replication factor is 1), but with HDFS it is indeed just one directory with just one file (sit on a specific machine). And all of the executors go to that directory. When replication factor is more, than 1, then from the spark perspective it's still just one directory, but requests would come to different nodes - there, where copies of the file are.
Upvotes: 1
Reputation: 1992
Use hdfs filesystem instead of local file system, which is accessible from all spark node.
Upvotes: 0