newleaf
newleaf

Reputation: 2457

pyspark column value is a list

Working with a dataframe which contains a column, the values in the columns are lists, how could I process such column

id    |   values
1     |   [1,1,2,4,3,5,6]
2     |   [1,2,3,5,6,7,8]
....

For each row take the first three values and get the max out the first three

Expected as:

id  | max_value
1   | 2
2   | 3
....

Upvotes: 0

Views: 746

Answers (1)

A.B
A.B

Reputation: 20445

You can use slice and array_max functions from pyspark sql.functions

For example, by passing array_max(slice(values, 1, 3)) to F.expr, you are first list(slice) and taking max (array_max)

import pyspark.sql.functions as F
df
.withColumn("max_value", F.expr("array_max(slice(values, 1, 3))"))
.show(truncate=False)

+----------------+-
|id |max_value|
+----------------+-
|1  |2      |
|2  |3      |

+----------------+-

Upvotes: 1

Related Questions