Diogo Ribeiro
Diogo Ribeiro

Reputation: 43

Compute and compare the average of two columns

I started converting my Pandas implementations to pySpark but i'm having trouble going through some basic operations. So I have this table:

+-----+-----+----+
| Col1|Col2 |Col3|
+-----+-----+----+
|  1  |[1,3]|   0|
|  44 |[2,0]|   1|
|  77 |[1,5]|   7|
+-----+-----+----+

My desired output is:

+-----+-----+----+----+
| Col1|Col2 |Col3|Col4|
+-----+-----+----+----+
|  1  |[1,3]|   0|2.67|
|  44 |[2,0]|   1|2.67|
|  77 |[1,5]|   7|2.67|
+-----+-----+----+----+

To get here :

Upvotes: 1

Views: 148

Answers (1)

mck
mck

Reputation: 42352

You can use greatest to get the greatest average of each (sub-)column in the array:

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'Col4',
    F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(2)])
)

df2.show()
+----+------+----+------------------+
|Col1|  Col2|Col3|              Col4|
+----+------+----+------------------+
|   1|[1, 3]|   0|2.6666666666666665|
|  44|[2, 0]|   1|2.6666666666666665|
|  77|[1, 5]|   7|2.6666666666666665|
+----+------+----+------------------+

If you want the array size to be dynamic, you can do

arr_size = df.select(F.max(F.size(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')))).head()[0]

df2 = df.withColumn(
    'Col4',
    F.greatest(*[F.avg(F.udf(lambda r: [float(i) for i in r.toArray()], 'array<double>')('Col2')[i]).over(Window.orderBy()) for i in range(arr_size)])
)

Upvotes: 1

Related Questions