Reputation: 15
I have a rdd in the following form:
[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]
and
(pos1 ,pos2, ............, posn)
Q: How can I map each position to its key ?(to be something like the following)
("b", "e", "a", "d", "a" .....)
// "b" correspond to pos 1, "e" correspond to pose 2 and ...
Example(edit) :
// chunk of my data
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)
// from rdd I can create my position rdd which is something like:
val positions = Vector(1,2,3,4,.......125) // my positions
// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...
Upvotes: 1
Views: 44
Reputation: 2518
Not sure you have to use Spark to address this specific use case (starting with a Vector, ending with a Vector containing all your data characters).
Nevertheless, here's some suggestion if it suits your needs :
val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)
val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
.reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
.toVector
.sortBy(_._1) //We sort data based on position
.map(_._2) // We only keep characters
.mkString
Upvotes: 2