Reputation: 2321
I'm running into an issue with a Spark job that fails roughly every 2nd time with the following error message:
org.apache.spark.SparkException: Job aborted due to stage failure: A shuffle map stage with indeterminate output was failed and retried. However, Spark cannot rollback the ResultStage XYZ to re-process the input data, and has to fail this job. Please eliminate the indeterminacy by checkpointing the RDD before repartition and try again.
This happens on Databricks 13.3 LTS (based on Apache Spark 3.4.1). I started out by step-wise eliminating calls to repartition(...)
until there was none left, but I still receive the above error. My next hypothesis was that it's due to adaptive query execution (AQE) which may change partitioning on-the-fly. But turning off AQE didn't help either.
What else could be leading to the above error if not explicit calls to repartition
or AQE, and what can be done to prevent it?
Upvotes: 4
Views: 5654
Reputation: 2321
It looks like the problem was caused by non-eager (lazy) checkpoints, i.e. calls to checkpoint(eager=False)
. By replacing non-eager checkpoints with eager ones, the problem seems to have disappeared.
Note that according to https://github.com/apache/spark/blob/6d4d76463144c7c493cfd1f3bf5950c803d45f49/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L705:
When checkpoint is used with eager = false, the final data that is checkpointed after the first action may be different from the data that was used during the job due to non-determinism of the underlying operation and retries. If checkpoint is used to achieve saving a deterministic snapshot of the data, eager = true should be used. Otherwise, it is only deterministic after the first execution, after the checkpoint was finalized.
Upvotes: 0
Reputation: 1
I faced the same issue. It was solved after I removed the autoscaling both on the cluster and on disk storage.
Upvotes: -1