Ankita
Ankita

Reputation: 490

SCALA : Read the text file and create tuple of it

How to create a tuple from the below-existing RDD?

// reading a text file "b.txt" and creating RDD 
val rdd = sc.textFile("/home/training/desktop/b.txt") 

b.txt dataset -->

 Ankita,26,BigData,newbie
 Shikha,30,Management,Expert

Upvotes: 0

Views: 2524

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

If you are intending to have Array[Tuples4] then you can do the following

scala> val rdd = sc.textFile("file:/home/training/desktop/b.txt")
rdd: org.apache.spark.rdd.RDD[String] = file:/home/training/desktop/b.txt MapPartitionsRDD[5] at textFile at <console>:24

scala> val arrayTuples = rdd.map(line => line.split(",")).map(array => (array(0), array(1), array(2), array(3))).collect
arrayTuples: Array[(String, String, String, String)] = Array((" Ankita",26,BigData,newbie), (" Shikha",30,Management,Expert))

Then you can access each fields as tuples

scala> arrayTuples.map(x => println(x._3))
BigData
Management
res4: Array[Unit] = Array((), ())

Updated

If you have variable sized input file as

Ankita,26,BigData,newbie
Shikha,30,Management,Expert
Anita,26,big

you can write match case pattern matching as

scala> val arrayTuples = rdd.map(line => line.split(",") match {
     | case Array(a, b, c, d) => (a,b,c,d)
     | case Array(a,b,c) => (a,b,c)
     | }).collect
arrayTuples: Array[Product with Serializable] = Array((Ankita,26,BigData,newbie), (Shikha,30,Management,Expert), (Anita,26,big))

Updated again

As @eliasah pointed that above procedure is a bad practice which is using product iterator. As his suggestion we should know the maximum elements of the input data and use following logic where we assign default values for no elements

val arrayTuples = rdd.map(line => line.split(",")).map(array => (Try(array(0)) getOrElse("Empty"), Try(array(1)) getOrElse(0), Try(array(2)) getOrElse("Empty"), Try(array(3)) getOrElse("Empty"))).collect

And as @philantrovert pointed out, we can verify the output in the following way, if we are not using REPL

arrayTuples.foreach(println)

which results to

(Ankita,26,BigData,newbie)
(Shikha,30,Management,Expert)
(Anita,26,big,Empty)

Upvotes: 5

Related Questions