icecream
icecream

Reputation: 23

How to convert each element of array to an array of array in spark

Given a dataset with multiple lines:

0,1,2

7,8,9

18,19,5

How to produce results in Spark:

Array(Array(Array(0),Array(1),Array(2)),Array(Array(7),Array(8),Array(9)), Array(Array(18),Array(19),Array(5))

Upvotes: 0

Views: 791

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41987

If you are talking about RDD[Array[Array[Int]]] in spark which is equivalent to Array[Array[Array[Int]]] in scala, then you can do the following

supposing you have a text file (/home/test.csv) as having

0,1,2
7,8,9
18,19,5

you can do

scala> val data = sc.textFile("/home/test.csv")
data: org.apache.spark.rdd.RDD[String] = /home/test.csv MapPartitionsRDD[4] at textFile at <console>:24

scala> val array = data.map(line => line.split(",").map(x => Array(x.toInt)))
array: org.apache.spark.rdd.RDD[Array[Array[Int]]] = MapPartitionsRDD[5] at map at <console>:26

You can take one step further to have RDD[Array[Array[Array[Int]]]] which says that each value of rdd is the type you want, then you can use wholeTextFile as it reads a file into tuple2(filename, texts in the file)

scala> val data = sc.wholeTextFiles("/home/test.csv")
data: org.apache.spark.rdd.RDD[(String, String)] = /home/test.csv MapPartitionsRDD[3] at wholeTextFiles at <console>:24

scala> val array = data.map(t2 => t2._2.split("\n").map(line => line.split(",").map(x => Array(x.toInt))))
array: org.apache.spark.rdd.RDD[Array[Array[Array[Int]]]] = MapPartitionsRDD[4] at map at <console>:26

Upvotes: 1

Related Questions