Reputation: 441
I work on a function working with 4 imputs.
To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip
, but I can't get an array list with the 4 elements :
+----+----+---------+---------+
| l1 | l2 | l3 | l4 |
+----+----+---------+---------+
| 1 | 5 | [1,2,3] | [2,2,2] |
| 2 | 9 | [8,2,7] | [1,7,7] |
| 3 | 3 | [8,4,9] | [5,1,3] |
| 4 | 1 | [5,5,3] | [8,4,3] |
What I want to get :
+----+----+---------+---------+------------------------------------------+
| l1 | l2 | l3 | l4 | l5 |
+----+----+---------+---------+------------------------------------------+
| 1 | 5 | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2 | 9 | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3 | 3 | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4 | 1 | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |
My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip
. I did,'t found a consistent way to create this list.
As long as I obtained this list of list, I would apply a function as follow:
def is_good(data):
a,b,c,d = data
return a+b+c+d
is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))
spark.udf.register("is_good_udf ", is_good, T.FloatType())
My guess would be to build something like this, thanks to @kafels, where for each rows and each list of the list, it applies the function :
df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))
In order to obtain a list of results as [9, 10, 11]
for the first row for instance.
Upvotes: 2
Views: 594
Reputation: 4069
You can use expr
function and apply TRANSFORM
:
import pyspark.sql.functions as f
df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))
# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3 |l4 |l5 |
# +---+---+---------+---------+------------------------------------------+
# |1 |5 |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2 |9 |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3 |3 |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4 |1 |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+
Upvotes: 5