Reputation: 2497
Problem statement
Need to format Spark output (remove CompactBuffer) after grouping the RDD
Input
Header1^Header2
A^4B
A^11A
B^7A
C^6DF
C^7DS
Desired Output
(A,(4B,11A))
(B,(7A))
(C,(6DF,7DS))
What have I tried
val records = sc.textFIle("/user/chronicles/test.txt").map(x => {
val y = x.split("\\^",-1)
(y(0).trim(),
y(1).trim())
}).groupBy(x => x._1)
records.foreach(println)
Output
(A,CompactBuffer((4B,11A)))
(B,CompactBuffer((7A)))
(C,CompactBuffer((6DF,7DS)))
In my solution, I can remove "CompactBuffer" by reading each element using foreach and then substitute the word and extra symbols using replace command
Is there any other way which can be used to format the data.
Note : I have followed : "how to remove compactbuffer in spark output" - mkString didnt work in this case
Upvotes: 1
Views: 2259
Reputation: 40370
If I understand your question correctly, here you go :
val data = sc.parallelize(Seq("Header1^Header2", "A^4B", "A^11A", "B^7A", "C^6DF", "C^7DS"))
.map(x => {
val y = x.split("\\^", -1)
(y(0).trim(), y(1).trim())
}).groupBy(x => x._1).mapValues(_.map(_._2).mkString("(",",",")"))
data.collect.foreach(println)
// (A,(4B,11A))
// (B,(7A))
// (C,(6DF,7DS))
// (Header1,(Header2))
To drop the header, you can use a filter. I'm not sure if this is the question here. If so, please comment so I can correct it.
Upvotes: 3