Reputation: 50732
The java.util.stream
package documentation gives this definition for identity in the context of reduction:
the
identity
value must be an identity for the combiner function. This means that for allu
,combiner.apply(identity, u)
is equal tou
.
Stream.reduce()
and its primitive counterparts offer a similar definition.
As far as I understand, this definition is necessary to support parallel streams. For example, a non-zero seed value for a sum reduction could be multiplied by the number of parallel processors, distorting the final result unpredictably.
But this definition seems stricter than necessary to support parallelism. Why not require an idempotent value x
, such that combiner.apply(x, x)
is equal to x
? This would protect functions like Integer::sum
, (a, b) -> a * b
and String::concat
from skewing across multiple processes, while still allowing the use of any seed with idempotent functions like Math::max
, (a, b) -> a | b
and Sets::intersection
.
Is there some unique benefit to an identity value that I'm overlooking?
Upvotes: 4
Views: 239
Reputation: 28163
It's not enough to ask "why not?" and offer some examples where it won't cause a problem. You have to offer a proof that with this relaxed requirement, sequential and parallel evaluations are guaranteed to produce the same correct result.
Let's use the 2-argument reduce to simplify matters. Let's also limit ourselves to a stream with 2 elements (a, b). let identity be i and accumulator be op.
When evaluated sequentially, the result is (i op a) op b. When evaluated in parallel, it might be (i op a) op (i op b) (or there might even be a naked i in between, but let's not worry about that for now). And we want both of those to equal a op b
If i is required to be an identity, it is easy to see that both sequential and parallel evaluations will be correct. But if all we require of i is that i op i = i, then it does not obviously follow that (i op a) op (i op b) must equals a op b for any associative op and any a and b.
One counter-example I can think of is space-collapsing concatenation: (x, y) -> (x + y).replaceAll(" +", " ")
. It is associative and " "
is idempotent but not an identity with respect to this function. Parallel and sequential evaluations will yield different results.
Upvotes: 4
Reputation: 159155
Why would combiner.apply(x, x) == x
be a good definition of identity?
Say the combiner is max()
. Sure, max(x, x) == x
is true. It is actually true for any x
, which means that any value would be a good identity
value for max()
.
That is certainly not true, since the only valid identity value for max()
combiner is MIN_VALUE
(or a "no value" such as null
or NaN
, assuming combiner understands and ignores such a value).
Upvotes: 0