Powers
Powers

Reputation: 19328

Creating a Spark Vector Column with createDataFrame

I can make a Spark DataFrame with a vector column with the toDF method.

val dataset = Seq((1.0, org.apache.spark.ml.linalg.Vectors.dense(0.0, 10.0, 0.5))).toDF("id", "userFeatures")

scala> dataset.printSchema()
root
 |-- id: double (nullable = false)
 |-- userFeatures: vector (nullable = true)


scala> dataset.schema
res5: org.apache.spark.sql.types.StructType = StructType(StructField(id,DoubleType,false), StructField(userFeatures,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))

I'm not sure how to create a vector column with the createDataFrame method. There isn't a VectorType type in org.apache.spark.sql.types.

This doesn't work:

val rows = spark.sparkContext.parallelize(
  List(
    Row(1.0, Vectors.dense(1.0, 2.0))
  )
)

val schema = List(
  StructField("id", DoubleType, true),
  StructField("features", new org.apache.spark.ml.linalg.VectorUDT, true)
)

val df = spark.createDataFrame(
  rows,
  StructType(schema)
)

df.show()
df.printSchema()

Upvotes: 1

Views: 3971

Answers (1)

himanshuIIITian
himanshuIIITian

Reputation: 6085

To create a Spark Vector Column with createDataFrame, you can use following code:

val rows = spark.sparkContext.parallelize(
  List(
    Row(1.0, org.apache.spark.mllib.linalg.Vectors.dense(1.0, 2.0))
  )
)

val schema = List(
  StructField("id", DoubleType, true),
  StructField("features", new org.apache.spark.mllib.linalg.VectorUDT, true)
)

val df = spark.createDataFrame(
  rows,
  StructType(schema)
)

df.show()
+---+---------+
| id| features|
+---+---------+
|1.0|[1.0,2.0]|
+---+---------+

df.printSchema()
root
 |-- id: double (nullable = true)
 |-- features: vector (nullable = true)

The actual issue was incompatible type org.apache.spark.ml.linalg.Vectors.dense which is not a valid external type for schema of vector. So, we have to switch to mllib package instead of ml package.

I hope it helps!

Note: I am using Spark v2.3.0. Also, class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg.

For reference - https://github.com/apache/spark/tree/master/mllib/src/main/scala/org/apache/spark/mllib

Upvotes: 2

Related Questions