Juan Luis Sanchez
Juan Luis Sanchez

Reputation: 5

Extract feature columns results in an (numberOfFeatures,Array[nonZeroFeatIndexes],Array[nonZeroFeatures]) instead of an array of those columns

I'm using Spark MLLib with Scala to load a csv file and transform the features in a feature vector to use it to train some models; for that, I'm using the following code:

// Loading the data
val rawData = spark.read.option("header", "true").csv(data)      // id, feat0, feat1, feat2,...
val rawLabels = spark.read.option("header", "true").csv(labels)  // id, label
val rawDataSet = rawData.join(rawLabels,"id")

// Set features columns
val featureCols = rawTrainingDataSet.columns.drop(1) // drop the id column
    
// TypeString in the csv columns so need to cast to Double
val exprs = featureCols.map(c => col(c).cast("Double"))
  
// Assembler taking a sample of just 5 columns; it should use "featureCols" as parameter value for "setInputCols" in the real case
val assembler = new VectorAssembler()
  .setInputCols(Array("feat0", "feat1", "feat2", "feat3", "feat4", "feat5"))
  .setOutputCol("features")

// Select all the column values to create the "features" column with them
val result = assembler.transform(rawTrainingDataSet.select(exprs: _*)).select($"features")
result.show(5,false)

This is working but I'm not getting the expected results for the features column as shown in the documentation https://spark.apache.org/docs/2.4.4/ml-features.html#vectorassembler; instead I'm getting this:

feat0|feat1|feat2|feat3|feat4|feat5| features
39.0 |0.0  |  1.0|  0.0|  0.0|  1.0| [39.0,0.0,1.0,1.0,0.0,0.0]
29.0 |0.0  |  1.0|  0.0|  0.0|  0.0| (6,[0,2],[29.0,1.0])
53.0 |1.0  |  0.0|  0.0|  0.0|  0.0| (6,[0,1],[53.0,1.0])
31.0 |0.0  |  1.0|  0.0|  0.0|  1.0| (6,[0,2,5],[31.0,1.0,1.0])
37.0 |0.0  |  1.0|  0.0|  0.0|  0.0| (6,[0,2],[37.0,1.0])

As you can see, for features column I am getting (number_of_features, [indexes_for_non_0_features], [value_for_non_zero_features]) but for the first row where I have the expected value and what I would like to have for all the DataFrame rows, an Array with all the column values, no matter if they are zero values. Could you point me any hints to know what I am doing wrong?

Thank you!!

Upvotes: 0

Views: 27

Answers (1)

Som
Som

Reputation: 6338

Convert sparse vector to dense as below -

 val sparseToDense =
      udf((v: org.apache.spark.ml.linalg.Vector) => v.toDense)
    
      result.withColumn("features_dense", sparseToDense(col("features")));
   

Upvotes: 0

Related Questions