Debaditya
Debaditya

Reputation: 2497

Spark - Remove CompactBuffer from group by output (RDD)

Problem statement

Need to format Spark output (remove CompactBuffer) after grouping the RDD

Input

Header1^Header2
A^4B
A^11A
B^7A
C^6DF
C^7DS

Desired Output

(A,(4B,11A))
(B,(7A))
(C,(6DF,7DS))

What have I tried

val records = sc.textFIle("/user/chronicles/test.txt").map(x => {
    val y = x.split("\\^",-1)
    (y(0).trim(),
     y(1).trim())
    }).groupBy(x => x._1)

records.foreach(println)

Output

 (A,CompactBuffer((4B,11A)))
 (B,CompactBuffer((7A)))
 (C,CompactBuffer((6DF,7DS)))

In my solution, I can remove "CompactBuffer" by reading each element using foreach and then substitute the word and extra symbols using replace command

Is there any other way which can be used to format the data.

Note : I have followed : "how to remove compactbuffer in spark output" - mkString didnt work in this case

Upvotes: 1

Views: 2259

Answers (1)

eliasah
eliasah

Reputation: 40370

If I understand your question correctly, here you go :

val data = sc.parallelize(Seq("Header1^Header2", "A^4B", "A^11A", "B^7A", "C^6DF", "C^7DS"))
           .map(x => {
              val y = x.split("\\^", -1)
             (y(0).trim(), y(1).trim())
           }).groupBy(x => x._1).mapValues(_.map(_._2).mkString("(",",",")"))

data.collect.foreach(println)
// (A,(4B,11A))
// (B,(7A))
// (C,(6DF,7DS))
// (Header1,(Header2))

To drop the header, you can use a filter. I'm not sure if this is the question here. If so, please comment so I can correct it.

Upvotes: 3

Related Questions