user3137110
user3137110

Reputation: 339

Getting unique values of pairs in an RDD when the order within the pair is irrelevant

I have a rdd with the values of

a,b
a,c
a,d
b,a
c,d
d,c
d,e

what I need is an rdd that contains the reciprocal pairs, but just one set. It would have to be:

a,b  or b,a
c,d  or d,c

I was thinking they could be added to a list and looped thru to find the the opposite pair, if one exists filter the first value out, and delete the reciprocal pair. I am thinking there must be a way of using scala functions like join or case, but I am having difficulty understanding them

Upvotes: 2

Views: 312

Answers (1)

marios
marios

Reputation: 8996

If you don't mind the order of each pair to change(e.g., (a,b) to become (b,a)), you can give a simple and easy to parallelize solution. The examples below use numbers but the pairs can be anything; as long as the values are comparable.

In vanilla Scala:

List(
 (2, 1), 
 (3, 2), 
 (1, 2), 
 (2, 4), 
 (4, 2)).map{ case(a,b) => if (a>b) (a,b) else (b,a) }.toSet

This will result in:

res1: Set[(Int, Int)] = Set((2, 1), (3, 2), (4, 2))

In Spark RDD the above can be expressed as:

sc.parallelize((2, 1)::(3, 2)::(2, 1)::(4, 2)::(4, 2)::Nil).map{ case(a,b) => 
   if (a>b) (a,b) else (b,a) }.distinct()

Upvotes: 3

Related Questions