Reputation: 43
I started converting my Pandas implementations to pySpark but i'm having trouble going through some basic operations. So I have this table:
+-----+-----+----+
| Col1|Col2 |Col3|
+-----+-----+----+
| 1 |[1,3]| 0|
| 44 |[2,0]| 1|
| 77 |[1,5]| 7|
+-----+-----+----+
My desired output is:
+-----+-----+----+----+
| Col1|Col2 |Col3|Col4|
+-----+-----+----+----+
| 1 |[1,3]| 0|2.67|
| 44 |[2,0]| 1|2.67|
| 77 |[1,5]| 7|2.67|
+-----+-----+----+----+
To get here :
Upvotes: 1
Views: 148
Reputation: 42352
You can use greatest
to get the greatest average of each (sub-)column in the array:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(2)])
)
df2.show()
+----+------+----+------------------+
|Col1| Col2|Col3| Col4|
+----+------+----+------------------+
| 1|[1, 3]| 0|2.6666666666666665|
| 44|[2, 0]| 1|2.6666666666666665|
| 77|[1, 5]| 7|2.6666666666666665|
+----+------+----+------------------+
If you want the array size to be dynamic, you can do
arr_size = df.select(F.max(F.size(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')))).head()[0]
df2 = df.withColumn(
'Col4',
F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(arr_size)])
)
Upvotes: 1