user225508
user225508

Reputation: 21

Scala RDD[String] to RDD[String,String]

I have a RDD[String] which contains following data:

data format : ('Movie Name','Actress Name')

('Night of the Demons (2009)  (uncredited)', '"Steff", Stefanie Oxmann Mcgaha')
('The Bad Lieutenant: Port of Call - New Orleans (2009)  (uncredited)', '"Steff", Stefanie Oxmann Mcgaha') 
('"Please Like Me" (2013) {All You Can Eat (#1.4)}', '$haniqua') 
('"Please Like Me" (2013) {French Toast (#1.2)}', '$haniqua') 
('"Please Like Me" (2013) {Horrible Sandwiches (#1.6)}', '$haniqua')

I want to convert this to RDD[String,String] such as first element within ' ' will be my first String in RDD and second element within ' ' will be my second String in RDD.

I tried this:

val rdd1 = sc.textFile("/home/user1/Documents/TestingScala/actress"
val splitRdd = rdd1.map( line => line.split(",") )
splitRdd.foreach(println)

but it's giving me an error as :

[Ljava.lang.String;@7741fb9
[Ljava.lang.String;@225f63a5
[Ljava.lang.String;@63640bc4
[Ljava.lang.String;@1354c1de

Upvotes: 1

Views: 2129

Answers (4)

Shankar
Shankar

Reputation: 8957

Try this to convert RDD[String] to RDD[String,String]

val rdd1 = sc.textFile("/home/user1/Documents/TestingScala/actress"
val splitRdd = rdd1.map( line => (line.split(",")(0), line.split(",")(1)) )

The above line returns the rdd as key, value pair [Tuple] RDD.

Upvotes: 0

KiranM
KiranM

Reputation: 1323

Since it is csv file with field-enclosed & row-enclosed, you need to read the file using regular expressions. Simple split doesn't work.

Upvotes: 0

Kris
Kris

Reputation: 1734

It's not an error. we could also use flatMap() here to avoid confusion,

val rdd1 = sc.textFile("/home/user1/Documents/TestingScala/actress"
rdd1.flatMap( line => line.split(",")).foreach(println)

Here, The input function to map returns a single element (array), while the flatMap returns a list of elements (0 or more). Also, the output of the flatMap is flattened.

Upvotes: 0

Pawan B
Pawan B

Reputation: 4623

[Ljava.lang.String;@7741fb9 is not an error, This is wt is printed when you try to print an array.

[ - an single-dimensional array

L - the array contains a class or interface

java.lang.String - the type of objects in the array

@ - joins the string together

7741fb9 the hashcode of the object.

To print String array you can try this code:

import scala.runtime.ScalaRunTime._
splitRdd.foreach(array => println(stringOf(array)))

Source

Upvotes: 5

Related Questions