Zyoma
Zyoma

Reputation: 1578

Can't use Vector from Spark ML Lib for the DataFrame

When I'm trying to use UDF that returns the Vector object, Spark throws the following exception:

Cause: java.lang.UnsupportedOperationException: Not supported DataType: org.apache.spark.mllib.linalg.VectorUDT@f71b0bce

How can I use Vector in my UDFs? The Spark version is 1.5.1.

UPD

val dataFrame: DataFrame = sqlContext.createDataFrame(Seq(
  (0, 1, 2),
  (0, 3, 4),
  (0, 5, 6)
)).toDF("key", "a", "b")

val someUdf = udf {
  (a: Double, b: Double) => Vectors.dense(a, b)
}

dataFrame.groupBy(col("key"))
  .agg(someUdf(avg("a"), avg("b")))

Upvotes: 1

Views: 1361

Answers (1)

zero323
zero323

Reputation: 330163

There is nothing wrong with your UDF per se. It looks like you get an exception because you call it inside agg method on aggregate columns. To make it work you can simply push it outside agg step:

dataFrame
  .groupBy($"key")
  .agg(avg($"a").alias("a"), avg($"b").alias("b"))
  .select($"key", someUdf($"a", $"b"))

Upvotes: 1

Related Questions