mitchus
mitchus

Reputation: 4877

What is the most efficient way in Spark to get unique pairs in a PairRDD?

Given an initial: PairRDD[(Long, Long)], what is the most efficient method to get an other: PairRDD[(Long, Long)] which contains each pair of initial exactly once? (i.e. filters out duplicate pairs.)

Specifically, is there something more efficient than initial.distinct()?

Upvotes: 0

Views: 896

Answers (1)

zero323
zero323

Reputation: 330443

In general case, when you make no assumptions about data distribution and you require exact results distinct implements pretty much a minimal correct solution which:

  • drops duplicates for each initial partition
  • shuffles the data
  • drop duplicates for each shuffle output

So unless you want to modify internals there is not much you can improve here.

That being said if you can make some assumptions and / or reduce requirements you can improve on that.

  • If you expect rare duplicates you can avoid map side combine to reduce memory usage and GC in the initial phase. You can do it for example with combineByKey with mapSideCombine combine set to false.
  • If you can accept some loss of data you can use a set of bloom filters at the cost of additional job.
  • If data is orderable you could try to use external sort followed by a linear scan to avoid storing hash map in memory. For starters you can take a look at the repartitionAndSortWithinPartitions and ExternalSorter.
  • If RDD has partitioner set (for example is a result of some kind of byKey operation) you can perform only local distinct-like operation with exact choice depending on amount of data.

Upvotes: 2

Related Questions