Reputation: 2928
Iam running a spark simple program on a cluster:
val logFile = "/home/hduser/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println()
println()
println()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
println()
println()
println()
println()
println()
and i get the following error
15/10/27 19:44:01 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 6) on
executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 6]
15/10/27 19:44:01 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times;
aborting job
15/10/27 19:44:01 INFO TaskSetManager: Lost task 1.3 in stage 0.0 (TID 7)
on executor 192.168.0.19: java.io.FileNotFoundException (File
file:/home/hduser/README.md does not exist.) [duplicate 7]
15/10/27 19:44:01 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool
15/10/27 19:44:01 INFO TaskSchedulerImpl: Cancelling stage 0
15/10/27 19:44:01 INFO DAGScheduler: ResultStage 0 (count at
SimpleApp.scala:55) failed in 7.636 s
15/10/27 19:44:01 INFO DAGScheduler: Job 0 failed: count at
SimpleApp.scala:55, took 7.810387 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 192.168.0.19): java.io.FileNotFoundException: File file:/home/hduser/README.md does not exist.
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:397)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427)
at org.apache.hadoop.mapred.LineRecordReader.<init>(LineRecordReader.java:78)
at org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:51)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
The file is in the correct place. IF I REPLACE README.MD WITH REAMDME.TXT IT WILL WORK FINE. Can somenone help with this?
Thanks
Upvotes: 0
Views: 4611
Reputation: 848
It is simply because a file with an extension .md contains plain text with formatting information. When you save this file with .txt extension, the formatting informations are removed or not considered. sc.textFile() works with plain texts.
Upvotes: 1
Reputation: 1379
If you are running a multinode cluster, make sure all nodes have the file in the same given path, with respect to their own filesystem. Or, you know, just use HDFS.
In multinode case "/home/hduser/README.md"
path is distributed to worker nodes as well. README.md
probably exists only on master node. Now when workers try to access this file, they won't look into master's fs instead each will try to find it on their own fs. If you have the same file on the same path in every node. The code is very likely to work. To do this, copy the file to every node's fs using the same path.
As you've already noticed, the solution above is cumbersome. Hadoop FS, HDFS, solves this issue and many more. You should look into it.
Upvotes: 5