Is immutability a "must" or "should" for custom accumulators?

Question

I would like to create custom accumulators and I can't feel safe when using them since I can only test them locally for now.

My question is:

Is immutability a "must" or "should" when creating accumulators?

Although I can't find the link/reference now, I have read that only immutable objects are allowed for accumulators. However, in spark's api(1.6), addInPlace method of AccumulableParam and AccumulatorParam have the description as follows: "Merge two accumulated values together. Is allowed to modify and return the first value for efficiency (to avoid allocating objects)."

Which one is correct? And if mutable objects are allowed how to use them to safely create accumulators?

Let's say, I have a mutable class with with one field, and let that field to be an array of integers. How to override addInPlace method when we have a mutable class?

Should I write(Option1):

public MyClass addInPlace(MyClass c1, MyClass c2){
c1.update(c2); //Where int array of c1 is updated(let's say we add two arrays) and c1 itself is returned.
return c1;
}

Or should I write(Option2):

public MyClass addInPlace(MyClass c1, MyClass c2){
return update2(c1,c2); //Where a new MyClass object is returned with an array(created by adding arrays of c1 and c2)
}

Option2 seems safer but requires additional allocation. However, above quote from API says that modification is allowed to avoid allocation.

In addition, if I have an array of objects(let's say MyClass2) rather than array of integers, should I clone the objects, or use the objects themselves. Let's say I want to create an accumulator for a PriorityQueue of MyClass2 (Maybe I should enter another entry for this question?).

I will appreciate any answer and advanced reference/documents on accumulators/Spark, especially in java.

Edit:

I thank zero323 for the answer.

I wish I could find the link that confused me, but things are clearer now. However, I have 2 additional questions.

1) I encountered the following accumulator implementation to keep track of the number of times a browser type seen in log files. You may see the details from (https://brosinski.com/post/extending-spark-accumulators/).

Here is the implementation:

public class MapAccumulator implements AccumulatorParam>, Serializable {

@Override
public Map addAccumulator(Map t1, Map t2) {
    return mergeMap(t1, t2);
}

@Override
public Map addInPlace(Map r1, Map r2) {
    return mergeMap(r1, r2);

}

@Override
public Map zero(final Map initialValue) {
    return new HashMap<>();
}

private Map mergeMap( Map map1, Map map2) {
    Map result = new HashMap<>(map1);
    map2.forEach((k, v) -> result.merge(k, v, (a, b) -> a + b));
    return result;
}

}

My question is:

Why not we have

map2.forEach((k, v) -> map1.merge(k, v, (a, b) -> a + b));

Also, let's say I would like to have a

Map> or ArrayList>

Can I have something like (Option1):

public ArrayList> addInPlace(ArrayList> a1, ArrayList> a2) {
//For now, assume that a1 and a2 have the same size
for(int i=0;i



Or should I write (Option2):

public ArrayList> addInPlace(ArrayList> a1, ArrayList> a2) {
//For now, assume that a1 and a2 have the same size
ArrayList> result= new ArrayList>();
for(int i=0;i());
    result.get(i).addAll(a1.get(i));
    result.get(i).addAll(a2.get(i));
}
return result;
}


So is there a difference between 2 options in terms of accumulator safety?

2) By saying accumulators are not thread-safe, do you mean that an rdd element can update the accumulator multiple times? Or do you mean that objects used during the process can be changed from somewhere else in the code by another thread?

Or is it a problem only when shipping accumulators to driver, as written in the link zero323 shared (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Accumulable.scala#L43): 

"If this [[Accumulable]] is internal. Internal [[Accumulable]]s will be reported to the driver via heartbeats. For internal [[Accumulable]]s, R must be thread safe so that they can be reported correctly."

I apologize for the long entry, but I hope it will be helpful for the community as well.

Is immutability a "must" or "should" for custom accumulators?

Edit:

Answers (1)

Related Questions

Is immutability a &quot;must&quot; or &quot;should&quot; for custom accumulators?

Edit:

Answers (1)

Related Questions

Is immutability a "must" or "should" for custom accumulators?