Reputation: 1349
I have an RDD of the form RDD[((Int, Int), Vector[(Long, Long)])]
. Here I need to operate on the Vector part and remove any duplicate mappings there.
For example on the Vector part, if there are two pairs (3,2) and (3,4) then
only (3,2) should be preserved(any one).
A full fledged example would be
((1,1),Vector((3,3), (1,1), (1,2), (2,1), (2,2), (4,4)))
((5,2),Vector((4,3), (4,2), (2,1)))
((5,2),Vector((4,3), (1,2)))
which should become
((1,1),Vector((3,3), (1,1), ,(2,1), (4,4)))
((5,2),Vector((4,3), (2,1)))
((5,2),Vector((4,3), (1,2)))
I started with something like this but stuck.
curRDD.map{ case (left, right) =>
for((ll,li) <- right) yield {
}
}
How do I achieve this?
Upvotes: 3
Views: 99
Reputation: 41749
currRdd.map{case(left,right)=>(left, right.toMap.toVector)}
.toMap
conveniently keeps only one entry for each key.
Upvotes: 1
Reputation: 8996
Here is an implementation using a series of transformation instead of a for-comprehension. The order of the pairs can be different. Will that be an issue?
curRDD.map{ x => (x._1, x._2.groupBy(_._1).mapValues(_.head).values.toVector ) }
Upvotes: 2