Reputation: 742
I have a DataFrame called df1
with the following scheme:
root
|-- instances: string (nullable = true)
|-- features: vector (nullable = true)
|-- label: double (nullable = false)
Where features
and label
were obtained from a LabeledPoint
.
I would like to generate a new one DataFrame but modifying the content of the instances
and the features
.
In order to do so, I write the following:
val schema2 = new StructType()
.add("instances", "string")
.add("features", "org.apache.spark.ml.linalg.Vector") // also I've tried using `vector`
.add("label", "double")
val schemaEncoder = RowEncoder(schema2)
val df2 = df1.map {
case Row(inst: String, lp: org.apache.spark.ml.linalg.Vector, lbl: Double) => {
val nInst = modifyInstances(inst)
val nLP = nInst.split(",")
Row(nInst, nLP, lbl)
}
}(schemaEncoder)
When I run the code, the problem would be in .add("features", "org.apache.spark.ml.linalg.Vector")
Any suggestion?
Upvotes: 0
Views: 385
Reputation: 22439
You would need to specify its DataType
as org.apache.spark.ml.linalg.SQLDataTypes.VectorType
, like below:
import org.apache.spark.sql.types._
import org.apache.spark.ml.linalg.SQLDataTypes._
val schema2 = new StructType().
add("instances", StringType).
add("features", VectorType).
add("label", DoubleType)
// schema2: org.apache.spark.sql.types.StructType = StructType(
// StructField(instances,StringType,true),
// StructField(features,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true),
// StructField(label,DoubleType,true)
// )
Upvotes: 1