Declare StructType of a Dataframe: column containing org.apache.spark.ml.linalg.Vector

Question

I have a DataFrame called df1 with the following scheme:

root
 |-- instances: string (nullable = true)
 |-- features: vector (nullable = true)
 |-- label: double (nullable = false)

Where features and label were obtained from a LabeledPoint. I would like to generate a new one DataFrame but modifying the content of the instances and the features. In order to do so, I write the following:

val schema2 = new StructType()
  .add("instances", "string")
  .add("features", "org.apache.spark.ml.linalg.Vector")  // also I've tried using `vector`
  .add("label", "double")

val schemaEncoder = RowEncoder(schema2)

val df2 = df1.map {
  case Row(inst: String, lp: org.apache.spark.ml.linalg.Vector, lbl: Double) => {
    val nInst = modifyInstances(inst)
    val nLP = nInst.split(",")
    Row(nInst, nLP, lbl)
  }
}(schemaEncoder)

When I run the code, the problem would be in .add("features", "org.apache.spark.ml.linalg.Vector")

Any suggestion?

Leo C · Accepted Answer

You would need to specify its DataType as org.apache.spark.ml.linalg.SQLDataTypes.VectorType, like below:

import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.SQLDataTypes._

val schema2 = new StructType().
  add("instances", StringType).
  add("features", VectorType).
  add("label", DoubleType)
// schema2: org.apache.spark.sql.types.StructType = StructType(
//   StructField(instances,StringType,true),
//   StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true),
//   StructField(label,DoubleType,true)
// )

Declare StructType of a Dataframe: column containing org.apache.spark.ml.linalg.Vector

Answers (1)

Related Questions