Reputation: 17676
I have a spark dataframe like:
+-------------+------------------------------------------+
|a |destination |
+-------------+------------------------------------------+
|[a,Alice,1] |[[b,Bob,0], [e,Esther,0], [h,Fraudster,1]]|
|[e,Esther,0] |[[f,Fanny,0], [d,David,0]] |
|[c,Charlie,0]|[[b,Bob,0]] |
|[b,Bob,0] |[[c,Charlie,0]] |
|[f,Fanny,0] |[[c,Charlie,0], [h,Fraudster,1]] |
|[d,David,0] |[[a,Alice,1], [e,Esther,0]] |
+-------------+------------------------------------------+
with a schema of
|-- destination: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- var_only_0_and_1: integer (nullable = false)
how can I construct an UDF which operates on the column destination
, i.e. the wrapped array created by collect_list
UDF of spark to calculate the mean of the variable var_only_0_and_1
?
Upvotes: 2
Views: 3003
Reputation: 758
You can operate directly on the array as long you get the method signature of the UDF correct (something that has hit me hard in the past). Array columns become visible to a UDF as a Seq, and a Struct as a Row, so you'll need something like this:
def test (in:Seq[Row]): String = {
// return a named field from the second struct in the array
in(2).getAs[String]("var_only_0_and_1")
}
var udftest = udf(test _)
I've tested this on data looking like yours. I'm guessing its possible to iterate over the fields of the Seq[Row] in order to achieve what you want.
To be honest, I'm not at all sure about the type safety of doing this, and I believe that explode is the preferable way to do it as per @ayplam. Inbuilt functions will generally be fast than any UDF that a dev provides, as Spark cannot optimise a UDF.
Upvotes: 5
Reputation: 1953
You can use native spark sql functions for this.
df.withColumn("dest",explode(col("destination")).
groupBy("a").agg(avg(col("dest").getField("var_only_0_and_1")))
Upvotes: 0