Reputation: 799
Trying to create an array of all the features in a features Vector in Apache Spark and scala. I need to do this in order to create a Breeze Matrix of the features for various commputations in my algorithm. Currently the features are wrapped in a features vector and I want to extract each of these separately. I've been looking at the following question: Applying IndexToString to features vector in Spark
Here's my current code: (data is a Spark DataFrame, all features are Doubles)
val featureCols = Array("feature1", "feature2", "feature3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)
//now we slice the features back again
val featureSlicer = featureCols.map {
col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}_sliced").setNames(Array(s"${col}"))}
val output = featureSlicer.map(f => f.transform(dataWithFeatures).select(f.getOutputCol).as[Double].collect)
val array = output.flatten.toArray
However this fails with the following error: 'cannot resolve CAST("feature1" AS DOUBLE due to data type mismatch - cannot cast VectorUDT to DoubleType'
This seems odd since I can do the following without an error:
val array: Array[Double] = dataWithFeatures.select("feature1").as[Double].collect()
Any ideas how to fix this, and if there is a better way, as it seems inefficient to create a sequence of DataFrames and perform the operation on each one separately.
Thanks!
Upvotes: 2
Views: 1237
Reputation: 214957
Say if features
column is the vector
column that gets assembled from all the other features column, you can select the features
column, convert it to rdd
and then flatMap
it:
Example data:
dataWithFeatures.show
+--------+--------+--------+-------------+
|feature1|feature2|feature3| features|
+--------+--------+--------+-------------+
| 1| 2| 3|[1.0,2.0,3.0]|
| 4| 5| 6|[4.0,5.0,6.0]|
+--------+--------+--------+-------------+
import org.apache.spark.ml.linalg.Vector
dataWithFeatures.select("features").rdd.flatMap(r => r.getAs[Vector](0).toArray).collect
// res19: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)
Upvotes: 0