Spark Scala: Vector Dataframe to RDD of values

Question

I have a spark dataframe that has a vector in it:

org.apache.spark.sql.DataFrame = [sF: vector]

and I'm trying to convert it to a RDD of values:

org.apache.spark.rdd.RDD[(Double, Double)]

However, I haven't been able to convert it properly. I've tried:

val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}

and it compiles, but I get a runtime error:

scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)

when i do:

m2.take(10).foreach(println).

Is there something I'm doing wrong?

Daniel de Paula · Accepted Answer

Assuming you want the first two values of the vectors present in the sF column, maybe this will work:

import org.apache.spark.mllib.linalg.Vector
val m2 = m1
  .select($"sF")
  .map { case Row(v: Vector) => (v(0), v(1)) }

You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.

Note: you don't need to call .rdd if you are going to do a .map operation.

Spark Scala: Vector Dataframe to RDD of values

Answers (1)

Related Questions