mt88
mt88

Reputation: 3015

Spark Scala: Vector Dataframe to RDD of values

I have a spark dataframe that has a vector in it:

org.apache.spark.sql.DataFrame = [sF: vector]

and I'm trying to convert it to a RDD of values:

org.apache.spark.rdd.RDD[(Double, Double)] 

However, I haven't been able to convert it properly. I've tried:

val m2 = m1.select($"sF").rdd.map{case Row(v1, v2) => (v1.toString.toDouble, v2.toString.toDouble)}

and it compiles, but I get a runtime error:

scala.MatchError: [[-0.1111111111111111,-0.2222222222222222]] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) 

when i do:

m2.take(10).foreach(println).

Is there something I'm doing wrong?

Upvotes: 1

Views: 1041

Answers (1)

Daniel de Paula
Daniel de Paula

Reputation: 17872

Assuming you want the first two values of the vectors present in the sF column, maybe this will work:

import org.apache.spark.mllib.linalg.Vector
val m2 = m1
  .select($"sF")
  .map { case Row(v: Vector) => (v(0), v(1)) }

You are getting an error because when you do case Row(v1, v2), it will not match the contents of the rows in your DataFrame, because you are expecting two values on each row (v1 and v2), but there is only one: a Vector.

Note: you don't need to call .rdd if you are going to do a .map operation.

Upvotes: 3

Related Questions