Reputation: 313
According to AccumulatorV2, the output of it should be a type that can be read atomically (e.g., Int, Long), or thread-safely (e.g., synchronized collections) because it will be read from other threads.
Let's say I have a class called CheckSumAccumulator which extends from AccumulatorV2, CheckSumAccumulator's output type is CheckSum, CheckSumAccumulator has a private field called checkSum; CheckSum has a private field called count and it has public setting and getter methods.
public class CheckSumAccumulator extends AccumulatorV2<String, CheckSum> {
private CheckSum checkSum;
...
}
public class CheckSum extends Serializable {
private long count;
public long getCount() {
return count;
}
public void setCount(long count) {
this.count = count;
}
}
What could go wrong? Does Accumulator instance runs in single thread in each Executor?
Upvotes: 0
Views: 163
Reputation: 4888
Spark Accumulator is a shared variable that can be used to accumulate values across multiple tasks and stages in a Spark job. Because it is shared across multiple threads, it is important for it to be thread-safe to ensure that updates to the accumulator's value are atomic and consistent across all tasks and stages. If the accumulator were not thread-safe, it could lead to race conditions and inconsistent results. Thread-safety is typically achieved by using synchronization mechanisms such as locks or atomic operations.
Upvotes: 0