Reputation: 83
I've been using Spark/Hadoop on Dataproc for months both via Zeppelin and Dataproc console but just recently I got the following error.
Caused by: java.io.FileNotFoundException: /hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1530998908050_0001/blockmgr-9d6a2308-0d52-40f5-8ef3-0abce2083a9c/21/temp_shuffle_3f65e1ca-ba48-4cb0-a2ae-7a81dcdcf466 (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
First, I got this type of error on Zeppelin notebook and thought it was Zeppelin issue. This error however, seems to occur randomly. I suspect It has something to do with one of the Spark workers not being able to write in that path. So, I googled and was suggested to delete files under /hadoop/yarn/nm-local-dir/usercache/ on each Spark worker and check if there are available disk space on each worker. After doing so, I still sometimes had this error. I also ran a Spark job on Dataproc, this similar error also occurred. I'm on Dataproc image version 1.2.
thanks
Peeranat F.
Upvotes: 0
Views: 2920
Reputation: 2099
The following could be other possible causes:
So, I summarize the following actions as possible workarounds:
1.- Increase memory of the workers and master, this will discard if you face memory problems.
2.- Change image version of Dataproc.
3.- Change cluster properties to tune your cluster especially for mapreduce and spark.
Upvotes: 0
Reputation: 2855
Ok. We faced the same issue on GCP and the reason for this is resource preemption.
In GCP, resource preemption can be done by following two strategies,
This setting is done in GCP by your admin/ dev ops person to optimize cost and resource utilization of cluster, specially if it is being shared.
What you're stack trace tells me is that its node preemption. This error occurs randomly because some times the node that get preempted is your driver node that causes the app to fail all together.
You can see which nodes are preemptable in your GCP console.
Upvotes: 1