Reputation: 2457
Working with a dataframe which contains a column, the values in the columns are lists, how could I process such column
id | values
1 | [1,1,2,4,3,5,6]
2 | [1,2,3,5,6,7,8]
....
For each row take the first three values and get the max out the first three
Expected as:
id | max_value
1 | 2
2 | 3
....
Upvotes: 0
Views: 746
Reputation: 20445
You can use slice and array_max functions from pyspark sql.functions
For example, by passing array_max(slice(values, 1, 3))
to F.expr
, you are first list(slice
) and taking max (array_max
)
import pyspark.sql.functions as F
df
.withColumn("max_value", F.expr("array_max(slice(values, 1, 3))"))
.show(truncate=False)
+----------------+-
|id |max_value|
+----------------+-
|1 |2 |
|2 |3 |
+----------------+-
Upvotes: 1