CRM
CRM

Reputation: 1349

remove duplicate mappings in a vector of int pairs in scala

I have an RDD of the form RDD[((Int, Int), Vector[(Long, Long)])]. Here I need to operate on the Vector part and remove any duplicate mappings there. For example on the Vector part, if there are two pairs (3,2) and (3,4) then only (3,2) should be preserved(any one).

A full fledged example would be

((1,1),Vector((3,3), (1,1), (1,2), (2,1), (2,2), (4,4)))
((5,2),Vector((4,3), (4,2), (2,1)))
((5,2),Vector((4,3), (1,2)))

which should become

((1,1),Vector((3,3), (1,1), ,(2,1), (4,4)))
((5,2),Vector((4,3), (2,1)))
((5,2),Vector((4,3), (1,2)))

I started with something like this but stuck.

curRDD.map{ case (left, right) => 
for((ll,li) <- right) yield {

}
}

How do I achieve this?

Upvotes: 3

Views: 99

Answers (2)

The Archetypal Paul
The Archetypal Paul

Reputation: 41749

currRdd.map{case(left,right)=>(left, right.toMap.toVector)}

.toMap conveniently keeps only one entry for each key.

Upvotes: 1

marios
marios

Reputation: 8996

Here is an implementation using a series of transformation instead of a for-comprehension. The order of the pairs can be different. Will that be an issue?

curRDD.map{ x => (x._1, x._2.groupBy(_._1).mapValues(_.head).values.toVector ) }

Upvotes: 2

Related Questions