Reputation: 1041
I am trying to play a little bit with apache-spark cluster mode. So my cluster consists of a driver in my machine and a worker and manager in host machine(separate machine).
I send a textfile using sparkContext.addFile(filepath)
where the filepath is the path of my text file in local machine for which I get the following output:
INFO Utils: Copying /home/files/data.txt to /tmp/spark-b2e2bb22-487b-412b-831d-19d7aa96f275/userFiles-147c9552-1a77-427e-9b17-cb0845807860/data.txt
INFO SparkContext: Added file /home/files/data.txt at http://192.XX.XX.164:58143/files/data.txt with timestamp 1457432207649
But when I try to access the same file using SparkFiles.get("data.txt")
, I get the path to file in my driver instead of worker.
I am setting my file like this
SparkConf conf = new SparkConf().setAppName("spark-play").setMaster("spark://192.XX.XX.172:7077");
conf.setJars(new String[]{"jars/SparkWorker.jar"});
JavaSparkContext sparkContext = new JavaSparkContext(conf);
sparkContext.addFile("/home/files/data.txt");
List<String> file =sparkContext.textFile(SparkFiles.get("data.txt")).collect();
I am getting FileNotFoundException here.
Upvotes: 2
Views: 4279
Reputation: 21
I have recently faced the same issue and hopefully my solution can help other people solve this issue.
We know that when you use SparkContext.addFile(<file_path>)
, it sends the file to the automatically created working directories in the driver node (in this case, your machine) as well as the worker nodes of the Spark cluster.
The block of code that you shared where you are using SparkFiles.get("data.txt")
is being executed on the driver, so it returns the path to the file on the driver, instead of the worker. But, the task is being run on the worker and path to the file on the driver does not match the path to the file on the worker because the driver and worker nodes have different working directory paths. Hence, you get the FileNotFoundException
.
There is a workaround to this problem without using any distributed file system or ftp server. You should put the file in your working directory on your host machine. Then, instead of using SparkContext.get("data.txt")
, you use "./data.txt"
.
List<String> file = sparkContext.textFile("./data.txt").collect();
Now, even though there is a mismatch of working directory paths between the spark driver and worker nodes, you will NOT face FileNotFoundException
since you are using a relative path to access the file.
Upvotes: 2
Reputation: 3055
I think that the main issue is that you are trying to read the file via the textFile
method. What is inside the brackets of the textFile
method is executed in the driver program. In the worker node only the code tobe run against an RDD is performed. When you type textFile
what happens is that in your driver program it is created a RDD object with a trivial associated DAG.But nothing happens in the worker node.
Thus, when you're trying to collect the data, the worker is asked to read the file at the URL you've passed to textFile
, which is told by the driver. Since your file is in the local filesystem of the driver and the worker node doesn't have access to it, you get the FileNotFoundException
.
The solution is to make the file available to the worker node by putting it into a distributed filesystem as HDFS or via (s)ftp or you have to trasfer the file into the worker node before running the Spark job and then you have to put as an argument of textFile
the path of the file in the worker filesystem.
Upvotes: 0