Reputation: 483
When collecting the elements of a stream into a set, is there any advantage (or drawback) to also specifying .distinct()
on the stream? For example:
return items.stream().map(...).distinct().collect(toSet());
Given that the set will already remove duplicates, this seems redundant, but does it offer any performance advantage or disadvantage? Does the answer depend on whether the stream is parallel/sequential or ordered/unordered?
Upvotes: 18
Views: 13206
Reputation: 11959
While you have the same result, they don't do the same thing: toSet()
use HashSet
, and you lose the initial ordering which is what distinct can preserve if required:
From the javadoc:
Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.
If you require stability, then it is distinct()
. Using toSet()
after would be useless (if not required by an API).
That is however useful if you have an equals
implementing a partial equality:
class F {
int a;
int b;
@Override int hashCode() {return Objects.hashCode(a);}
@Override boolean equals(Object other) {
if (other == this) return true;
if (!(other instanceof F)) return false;
return a == ((F)other).a;
}
}
If you have a = F(10, 1)
and b = F(10, 2)
they are equals. But not all their fields are equals.
If in the list you have (b, a)
toSet()
you won't always have this order. You might have (b, a), etc.(b, a)
.This however assume some prerequisites (sequential, etc).
Note: this could be done using a TreeSet
and appropriate compareTo
method.
Upvotes: 2
Reputation: 120858
distinct will call equals/hashcode to separate items and later toSet will do the same thing (even if after distinct there is no need, but toSet can't really know that), so basically you are just duplicating the calls. It should performance slightly worse IMO. It is pretty easy to measure too.
Upvotes: 0
Reputation: 8202
According to the javadoc, distinct
is a stateful intermediate operation.
If you literally have .distinct
followed immediately by .collect
, it doesn't really add any benefit. Maybe if the .distinct
implementation is more performant than the Set
duplication check, you might get some benefit, but if you're collecting to a set you're going to end up with the same result anyway.
If, on the other hand, .distinct
occurs before your .map
operation, and that particular mapping is an expensive operation, you may get some gains there because you're processing less data overall.
Upvotes: 16