CalmAmity
CalmAmity

Reputation: 157

Spark worker throws FileNotFoundException on temporary shuffle files

I am running a Spark application that processes multiple sets of data points; some of these sets need to be processed sequentially. When running the application for small sets of data points (ca. 100), everything works fine. But in some cases, the sets will have a size of ca. 10,000 data points, and those cause the worker to crash with the following stack trace:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 26.0 failed 4 times, most recent failure: Lost task 0.3 in stage 26.0 (TID 36, 10.40.98.10, executor 1): java.io.FileNotFoundException: /tmp/spark-5198d746-6501-4c4d-bb1c-82479d5fd48f/executor-a1d76cc1-a3eb-4147-b73b-29742cfd652d/blockmgr-d2c5371b-1860-4d8b-89ce-0b60a79fa394/3a/temp_shuffle_94d136c9-4dc4-439e-90bc-58b18742011c (No such file or directory)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:102)
    at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:115)
    at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:235)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

I have checked all log files after multiple instances of this error, but did not find any other error messages.

Searching the internet for this problem, I have found two potential causes that do not seem to be applicable to my situation:

I have been flailing at this problem for a couple of hours, trying to find work-arounds and possible causes.

What is causing this problem? How can I go about determining the cause myself?

Upvotes: 5

Views: 5194

Answers (1)

CalmAmity
CalmAmity

Reputation: 157

The problem turns out to be a stack overflow (ha!) occurring on the worker.

On a hunch, I rewrote the operation to be performed entirely on the driver (effectively disabling Spark functionality). When I ran this code, the system still crashed, but now displayed a StackOverflowError. Contrary to what I previously believed, apparently tail-recursive methods can definitely cause a stack overflow just like any other form of recursion. After rewriting the method to no longer use recursion, the problem disappeared.

A stack overflow is probably not the only problem that can produce the original FileNotFoundException, but making a temporary code change which pulls the operation to the driver seems to be a good way to determine the actual cause of the problem.

Upvotes: 3

Related Questions