Reputation: 4877
Given an initial: PairRDD[(Long, Long)]
, what is the most efficient method to get an other: PairRDD[(Long, Long)]
which contains each pair of initial
exactly once? (i.e. filters out duplicate pairs.)
Specifically, is there something more efficient than initial.distinct()
?
Upvotes: 0
Views: 896
Reputation: 330443
In general case, when you make no assumptions about data distribution and you require exact results distinct
implements pretty much a minimal correct solution which:
So unless you want to modify internals there is not much you can improve here.
That being said if you can make some assumptions and / or reduce requirements you can improve on that.
combineByKey
with mapSideCombine
combine set to false
.repartitionAndSortWithinPartitions
and ExternalSorter
.byKey
operation) you can perform only local distinct-like operation with exact choice depending on amount of data.Upvotes: 2