aonghus
aonghus

Reputation: 23

Spark Dataframe Arraytype columns

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.

Something like this:

df = df.withColumn("max_$colname", max(col(colname)))

where each row of the column holds an array of values?

The functions in spark.sql.function appear to work on a column basis only.

Upvotes: 2

Views: 1532

Answers (1)

Manoj Kumar Dhakad
Manoj Kumar Dhakad

Reputation: 1892

You can apply user defined functions on the array column.

1.DataFrame

+------------------+
|               arr|
+------------------+
|   [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+

2.Creating UDF

import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))

3.Applying UDF in query

df.withColumn("arrMax",maxUDF(df("arr"))).show

4.Result

+------------------+------+
|               arr|arrMax|
+------------------+------+
|   [1, 2, 3, 4, 5]|     5|
|[4, 5, 6, 7, 8, 9]|     9|
+------------------+------+

Upvotes: 1

Related Questions