Spark - Remove CompactBuffer from group by output (RDD)

Question

Problem statement

Need to format Spark output (remove CompactBuffer) after grouping the RDD

Input

Header1^Header2
A^4B
A^11A
B^7A
C^6DF
C^7DS

Desired Output

(A,(4B,11A))
(B,(7A))
(C,(6DF,7DS))

What have I tried

val records = sc.textFIle("/user/chronicles/test.txt").map(x => {
    val y = x.split("\^",-1)
    (y(0).trim(),
     y(1).trim())
    }).groupBy(x => x._1)

records.foreach(println)

Output

 (A,CompactBuffer((4B,11A)))
 (B,CompactBuffer((7A)))
 (C,CompactBuffer((6DF,7DS)))

In my solution, I can remove "CompactBuffer" by reading each element using foreach and then substitute the word and extra symbols using replace command

Is there any other way which can be used to format the data.

Note : I have followed : "how to remove compactbuffer in spark output" - mkString didnt work in this case

eliasah · Accepted Answer

If I understand your question correctly, here you go :

val data = sc.parallelize(Seq("Header1^Header2", "A^4B", "A^11A", "B^7A", "C^6DF", "C^7DS"))
           .map(x => {
              val y = x.split("\^", -1)
             (y(0).trim(), y(1).trim())
           }).groupBy(x => x._1).mapValues(_.map(_._2).mkString("(",",",")"))

data.collect.foreach(println)
// (A,(4B,11A))
// (B,(7A))
// (C,(6DF,7DS))
// (Header1,(Header2))

To drop the header, you can use a filter. I'm not sure if this is the question here. If so, please comment so I can correct it.

Spark - Remove CompactBuffer from group by output (RDD)

Answers (1)

Related Questions