Reputation: 3
When a particular task fails that causes RDD to be recomputed from lineage (maybe by reading input file again), how does Spark ensure that there is no duplicate processing of data? What if the task that failed had written half of the data to some output like HDFS or Kafka ? Will it re-write that part of the data again? Is this related to exactly once processing?
Upvotes: 0
Views: 468
Reputation: 1821
Output operation by default has at-least-once semantics. The foreachRDD function will execute more than once if there’s worker failure, thus writing same data to external storage multiple times. There’re two approaches to solve this issue, idempotent updates, and transactional updates. They are further discussed in the following sections
Further reading
http://shzhangji.com/blog/2017/07/31/how-to-achieve-exactly-once-semantics-in-spark-streaming/
Upvotes: 0