Reputation: 339
I have a rdd with the values of
a,b
a,c
a,d
b,a
c,d
d,c
d,e
what I need is an rdd that contains the reciprocal pairs, but just one set. It would have to be:
a,b or b,a
c,d or d,c
I was thinking they could be added to a list and looped thru to find the the opposite pair, if one exists filter the first value out, and delete the reciprocal pair. I am thinking there must be a way of using scala functions like join or case, but I am having difficulty understanding them
Upvotes: 2
Views: 312
Reputation: 8996
If you don't mind the order of each pair to change(e.g., (a,b) to become (b,a)), you can give a simple and easy to parallelize solution. The examples below use numbers but the pairs can be anything; as long as the values are comparable.
In vanilla Scala:
List(
(2, 1),
(3, 2),
(1, 2),
(2, 4),
(4, 2)).map{ case(a,b) => if (a>b) (a,b) else (b,a) }.toSet
This will result in:
res1: Set[(Int, Int)] = Set((2, 1), (3, 2), (4, 2))
In Spark RDD the above can be expressed as:
sc.parallelize((2, 1)::(3, 2)::(2, 1)::(4, 2)::(4, 2)::Nil).map{ case(a,b) =>
if (a>b) (a,b) else (b,a) }.distinct()
Upvotes: 3