user1532146
user1532146

Reputation: 313

Why Spark Accumulator's output type need to be thread safe?

According to AccumulatorV2, the output of it should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.

Let's say I have a class called CheckSumAccumulator which extends from AccumulatorV2, CheckSumAccumulator's output type is CheckSum, CheckSumAccumulator has a private field called checkSum; CheckSum has a private field called count and it has public setting and getter methods.

public class CheckSumAccumulator extends AccumulatorV2<String, CheckSum> {
   private CheckSum checkSum;
   ...
}

public class CheckSum extends Serializable {
   private long count;
   public long getCount() {
     return count;
   }
   
   public void setCount(long count) {
     this.count = count;
   } 
}

What could go wrong? Does Accumulator instance runs in single thread in each Executor?

Upvotes: 0

Views: 163

Answers (1)

Abdennacer Lachiheb
Abdennacer Lachiheb

Reputation: 4888

Spark Accumulator is a shared variable that can be used to accumulate values across multiple tasks and stages in a Spark job. Because it is shared across multiple threads, it is important for it to be thread-safe to ensure that updates to the accumulator's value are atomic and consistent across all tasks and stages. If the accumulator were not thread-safe, it could lead to race conditions and inconsistent results. Thread-safety is typically achieved by using synchronization mechanisms such as locks or atomic operations.

Upvotes: 0

Related Questions