Adelin
Adelin

Reputation: 18991

What does "commutative and associative" mean in terms of Apache Beam and parallel processing, in general?

From the documentation

When you apply a Combine transform, you must provide the function that contains the logic for combining the elements or values. The combining function should be commutative and associative ....

Upvotes: 0

Views: 584

Answers (2)

Apurva Singh
Apurva Singh

Reputation: 5000

In Apache Spark, it distributes data to nodes, processes on multiple machines and then obtains results from distributed datasets. The distributed dataset is called RDD. Ok lets see these two properties and their importance in parallel systems.
val wordsRdd linesRdd.map(s => s.split(","))
This is simply splitting lines into words. map is a local operation, i.e. it will execute the split function on the nodes, and not coordination etc is needed between nodes.
val wordsRddWith1 = wordsRdd.map(w => (w, 1))
This is also local. This will create for example, 5 tuples like
('cat', 1), ('bat', 1), ('cat', 1), ('rat', 1), ('rat', 1)
Now we looks at reduceByKey. The function it is using to operate on tuples must be associative and commutative.
val wordsCount = wordsRddWith1.reduce((tpl1, tpl2) => tpl1._2 + tpl2._2)
This will produce:
cat, 2 | bat, 1 | rat, 2
Associative means, function must correctly execute, whether irrespective of how the 5 tuples above get distributed to nodes. Tuples can end up for processing in any node. Any tuple can associate with any other tuple on any node for the summing up of count locally in the node.
Commutative means, function must correctly execute, irrespective of the order in which function is applied. So a group of tuples may be processed by function before or after another group of tuples. It shouldn't matter.
So associative is about distributing data to nodes, and commutative is about processing data.

Upvotes: 0

Alain Merigot
Alain Merigot

Reputation: 11557

The meaning of associative and commutative is exactly the same as in mathematics.

An operator "+" is said to be commutative iff a+b=b+a
An operator "+" is said to be associative iff (a+b)+c=a+(b+c)

For the "combine transform" described in the documentation, you try to implement an accumulation.

s=a+b+c+d

where "+" is any operator.

Associativity is an absolute requirement to be able to parallelize such an operation. If "+" is not associative

  1. a+b+c+d has no meaning, as ((a+b)+c)+d != (a+(b+c))+d. To give an unparenthesised expression a signification, result must not depend on the grouping of operations.

  2. You cannot modify parenthesis to rearrange operations order to perform them in parallel
    (((a+b) + c) + d) is inherently sequential: compute a+b, then add c, then add d
    ((a+b) + (c+d)) allows to compute (a+b) and (c+d) in parallel.

Commutativity is less frequently required as a constraint for parallelization, but it allows to permute the order of operands.

Upvotes: 4

Related Questions