Reputation: 17676

spark UDF operate on array

I have a spark dataframe like:

+-------------+------------------------------------------+
|a            |destination                               |
+-------------+------------------------------------------+
|[a,Alice,1]  |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]]                |
|[c,Charlie,0]|[[b,Bob,0]]                               |
|[b,Bob,0]    |[[c,Charlie,0]]                           |
|[f,Fanny,0]  |[[c,Charlie,0], [h,Fraudster,1]]          |
|[d,David,0]  |[[a,Alice,1], [e,Esther,0]]               |
+-------------+------------------------------------------+

with a schema of

|-- destination: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- var_only_0_and_1: integer (nullable = false)

how can I construct an UDF which operates on the column destination, i.e. the wrapped array created by collect_list UDF of spark to calculate the mean of the variable var_only_0_and_1?

Upvotes: 2

Answers (2)

Chondrops

Reputation: 758

You can operate directly on the array as long you get the method signature of the UDF correct (something that has hit me hard in the past). Array columns become visible to a UDF as a Seq, and a Struct as a Row, so you'll need something like this:

def test (in:Seq[Row]): String = {
  // return a named field from the second struct in the array
  in(2).getAs[String]("var_only_0_and_1")
}

var udftest = udf(test _)

I've tested this on data looking like yours. I'm guessing its possible to iterate over the fields of the Seq[Row] in order to achieve what you want.

To be honest, I'm not at all sure about the type safety of doing this, and I believe that explode is the preferable way to do it as per @ayplam. Inbuilt functions will generally be fast than any UDF that a dev provides, as Spark cannot optimise a UDF.

Upvotes: 5

ayplam

Reputation: 1953

You can use native spark sql functions for this.

df.withColumn("dest",explode(col("destination")).
groupBy("a").agg(avg(col("dest").getField("var_only_0_and_1")))

Upvotes: 0

spark UDF operate on array

Answers (2)

Related Questions