Fahad Ibrar
Fahad Ibrar

Reputation: 107

Converting a RDD into a Dataframe Spark

How can I convert an RDD with the following structure into a dataframe in scala

org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)] = MapPartitionsRDD[42]

Here each row of RDD contains an index Long and a vector org.apache.spark.mllib.linalg.Vector.

I want to have each component of org.apache.spark.mllib.linalg.Vector in separate column in a row of dataframe.

Upvotes: 2

Views: 346

Answers (1)

rogue-one
rogue-one

Reputation: 11587

The below example works. Here for brevity I have assumed the vector size of 10. you should be able scale it to 1000

import org.apache.spark.mllib.linalg.Vectors
val rdd = sc.parallelize(Seq((1L,Vectors.dense((1 to 10).map(_ * 1.0).toArray))))
val df = rdd.map({case (a,b) => (a,b.toArray) }).toDF("c1", "c2")
df.select(($"c1" +: (0 to 9).map(idx => $"c2"(idx) as "c" + (idx + 2)):_*)).show()  
+---+---+---+---+---+---+---+---+---+---+----+
| c1| c2| c3| c4| c5| c6| c7| c8| c9|c10| c11|
+---+---+---+---+---+---+---+---+---+---+----+
|  1|1.0|2.0|3.0|4.0|5.0|6.0|7.0|8.0|9.0|10.0|
+---+---+---+---+---+---+---+---+---+---+----+

Upvotes: 1

Related Questions