Reputation: 157
I am running a Spark application that processes multiple sets of data points; some of these sets need to be processed sequentially. When running the application for small sets of data points (ca. 100), everything works fine. But in some cases, the sets will have a size of ca. 10,000 data points, and those cause the worker to crash with the following stack trace:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 36, 10.40.98.10, executor 1): java.io.FileNotFoundException: /tmp/spark-5198d746-6501-4c4d-bb1c-82479d5fd48f/executor-a1d76cc1-a3eb-4147-b73b-29742cfd652d/blockmgr-d2c5371b-1860-4d8b-89ce-0b60a79fa394/3a/temp_shuffle_94d136c9-4dc4-439e-90bc-58b18742011c (No such file or directory)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
I have checked all log files after multiple instances of this error, but did not find any other error messages.
Searching the internet for this problem, I have found two potential causes that do not seem to be applicable to my situation:
/tmp/
directory.
/tmp/
directory does not have enough space for shuffle files (or other temporary Spark files).
/tmp/
directory on my system has about 45GB available, and the amount of data in a single data point (< 1KB) means that this is also probably not the case.I have been flailing at this problem for a couple of hours, trying to find work-arounds and possible causes.
What is causing this problem? How can I go about determining the cause myself?
Upvotes: 5
Views: 5194
Reputation: 157
The problem turns out to be a stack overflow (ha!) occurring on the worker.
On a hunch, I rewrote the operation to be performed entirely on the driver (effectively disabling Spark functionality). When I ran this code, the system still crashed, but now displayed a StackOverflowError
. Contrary to what I previously believed, apparently tail-recursive methods can definitely cause a stack overflow just like any other form of recursion. After rewriting the method to no longer use recursion, the problem disappeared.
A stack overflow is probably not the only problem that can produce the original FileNotFoundException, but making a temporary code change which pulls the operation to the driver seems to be a good way to determine the actual cause of the problem.
Upvotes: 3