Leothorn
Leothorn

Reputation: 1345

Use Vector Assembler and extracting "features" as org.apache.spark.mllib.linalg.Vectors in spark scala

I have wanted to use the Gaussian Mixture Model in Spark 1.5.1 which uses the linalg.mllib.vector rdd .

This is my code

import org.apache.spark.mllib.clustering.GaussianMixture
import org.apache.spark.mllib.clustering.GaussianMixtureModel
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.DataFrameNaFunctions
dummy = dummy.na.drop
var colnames= dummy.columns
var df = dummy

for(x<-colnames)
{ 
    if (dummy.select(x).dtypes(0)._2.equals("StringType") || dummy.select(x).dtypes(0)._2.equals("LongType"))
    { df = df.drop(x)}

}

var colnames = df.columns
var assembler = new VectorAssembler().setInputCols(colnames).setOutputCol("features")
var output = assembler.transform(df)
var temp = output.select("features")

The problem is i am not able to change the feature column into org.apache.spark.mllib.linalg.Vector rdd

Anyone has an idea how to do this ?

Upvotes: 2

Views: 2658

Answers (1)

zero323
zero323

Reputation: 330073

Spark >= 2.0

Either map:

temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("features"))

or use as:

temp
  .select("features")
  .as[Tuple1[org.apache.spark.ml.linalg.Vector]]
  .rdd.map(_._1)

Spark < 2.0

Just map over RDD[Row] and extract the field:

temp.rdd.map(_.getAs[org.apache.spark.mllib.linalg.Vector]("features"))

Upvotes: 2

Related Questions