shmosel
shmosel

Reputation: 50732

Stream reduction identity vs. idempotent value

The java.util.stream package documentation gives this definition for identity in the context of reduction:

the identity value must be an identity for the combiner function. This means that for all u, combiner.apply(identity, u) is equal to u.

Stream.reduce() and its primitive counterparts offer a similar definition.

As far as I understand, this definition is necessary to support parallel streams. For example, a non-zero seed value for a sum reduction could be multiplied by the number of parallel processors, distorting the final result unpredictably.

But this definition seems stricter than necessary to support parallelism. Why not require an idempotent value x, such that combiner.apply(x, x) is equal to x? This would protect functions like Integer::sum, (a, b) -> a * b and String::concat from skewing across multiple processes, while still allowing the use of any seed with idempotent functions like Math::max, (a, b) -> a | b and Sets::intersection.

Is there some unique benefit to an identity value that I'm overlooking?

Upvotes: 4

Views: 239

Answers (2)

Misha
Misha

Reputation: 28163

It's not enough to ask "why not?" and offer some examples where it won't cause a problem. You have to offer a proof that with this relaxed requirement, sequential and parallel evaluations are guaranteed to produce the same correct result.

Let's use the 2-argument reduce to simplify matters. Let's also limit ourselves to a stream with 2 elements (a, b). let identity be i and accumulator be op.

When evaluated sequentially, the result is (i op a) op b. When evaluated in parallel, it might be (i op a) op (i op b) (or there might even be a naked i in between, but let's not worry about that for now). And we want both of those to equal a op b

If i is required to be an identity, it is easy to see that both sequential and parallel evaluations will be correct. But if all we require of i is that i op i = i, then it does not obviously follow that (i op a) op (i op b) must equals a op b for any associative op and any a and b.

One counter-example I can think of is space-collapsing concatenation: (x, y) -> (x + y).replaceAll(" +", " "). It is associative and " " is idempotent but not an identity with respect to this function. Parallel and sequential evaluations will yield different results.

Upvotes: 4

Andreas
Andreas

Reputation: 159155

Why would combiner.apply(x, x) == x be a good definition of identity?

Say the combiner is max(). Sure, max(x, x) == x is true. It is actually true for any x, which means that any value would be a good identity value for max().

That is certainly not true, since the only valid identity value for max() combiner is MIN_VALUE (or a "no value" such as null or NaN, assuming combiner understands and ignores such a value).

Upvotes: 0

Related Questions