Vaibhav Agrawal
Vaibhav Agrawal

Reputation: 120

Spark-Scala RDD

I have a RDD RDD1 with the following Schema:

RDD[String, Array[String]] 

(let's call it RDD1)

and I would like create a new RDD RDD2 with each row as RDD[String,String] with the key and value belonging to RDD1.

For example:

RDD1 =Array(("Fruit",("Orange","Apple","Peach")),("Shape",("Square","Rectangle")),("Mathematician",("Aryabhatt"))))

I want the output to be as:

RDD2 = Array(("Fruit","Orange"),("Fruit","Apple"),("Fruit","Peach"),("Shape","Square"),("Shape","Rectangle"),("Mathematician","Aryabhatt"))

Can someone help me with this piece of code?

My Try:

val R1 = RDD1.map(line => (line._1,line._2.split((","))))
val R2 = R1.map(line => line._2.foreach(ph => ph.map(line._1)))

This gives me an error:

error: value map is not a member of Char

I understand that it is because that map function is only applicable to the RDDs and not each string/char. Please help me with a way to use nested functions for this purpose in Spark.

Upvotes: 0

Views: 353

Answers (1)

Arjan
Arjan

Reputation: 21485

Break down the problem.

  1. ("Fruit",Array("Orange","Apple","Peach") -> Array(("Fruit", "Orange"), ("Fruit", "Apple"), ("Fruit", "Peach"))

def flattenLine(line: (String, Array[String])) = line._2.map(x => (line._1, x)

  1. Apply that function to your rdd:

rdd1.flatMap(flattenLine)

Upvotes: 4

Related Questions