LucieCBurgess
LucieCBurgess

Reputation: 799

Create an Array from features vector in Apache Spark/ Scala

Trying to create an array of all the features in a features Vector in Apache Spark and scala. I need to do this in order to create a Breeze Matrix of the features for various commputations in my algorithm. Currently the features are wrapped in a features vector and I want to extract each of these separately. I've been looking at the following question: Applying IndexToString to features vector in Spark

Here's my current code: (data is a Spark DataFrame, all features are Doubles)

val featureCols = Array("feature1", "feature2", "feature3") 
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)

//now we slice the features back again
val featureSlicer = featureCols.map {
  col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}_sliced").setNames(Array(s"${col}"))}
val output = featureSlicer.map(f => f.transform(dataWithFeatures).select(f.getOutputCol).as[Double].collect)
val array = output.flatten.toArray

However this fails with the following error: 'cannot resolve CAST("feature1" AS DOUBLE due to data type mismatch - cannot cast VectorUDT to DoubleType'

This seems odd since I can do the following without an error:

val array: Array[Double] = dataWithFeatures.select("feature1").as[Double].collect()

Any ideas how to fix this, and if there is a better way, as it seems inefficient to create a sequence of DataFrames and perform the operation on each one separately.

Thanks!

Upvotes: 2

Views: 1237

Answers (1)

akuiper
akuiper

Reputation: 214957

Say if features column is the vector column that gets assembled from all the other features column, you can select the features column, convert it to rdd and then flatMap it:

Example data:

dataWithFeatures.show
+--------+--------+--------+-------------+
|feature1|feature2|feature3|     features|
+--------+--------+--------+-------------+
|       1|       2|       3|[1.0,2.0,3.0]|
|       4|       5|       6|[4.0,5.0,6.0]|
+--------+--------+--------+-------------+

import org.apache.spark.ml.linalg.Vector

dataWithFeatures.select("features").rdd.flatMap(r => r.getAs[Vector](0).toArray).collect
// res19: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)

Upvotes: 0

Related Questions