Reputation: 23
I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.
Something like this:
df = df.withColumn("max_$colname", max(col(colname)))
where each row of the column holds an array of values?
The functions in spark.sql.function appear to work on a column basis only.
Upvotes: 2
Views: 1532
Reputation: 1892
You can apply user defined functions on the array column.
1.DataFrame
+------------------+
| arr|
+------------------+
| [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+
2.Creating UDF
import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))
3.Applying UDF in query
df.withColumn("arrMax",maxUDF(df("arr"))).show
4.Result
+------------------+------+
| arr|arrMax|
+------------------+------+
| [1, 2, 3, 4, 5]| 5|
|[4, 5, 6, 7, 8, 9]| 9|
+------------------+------+
Upvotes: 1