Pyspark: Convert Dense vector to columns

Question

I have a data frame with four columns, with one column being a dense vector.

cust_id	label	prediction	probability
1	0	0	{"vectorType":"dense","length":2,"values":[0.5745528913772013,0.4254471086227987]}
2	0	0	{"vectorType":"dense","length":2,"values":[0.5185219003114524,0.4814780996885476]}
3	0	1	{"vectorType":"dense","length":2,"values":[0.37871114732242217,0.6212888526775778]}
4	0	1	{"vectorType":"dense","length":2,"values":[0.4352110724347864,0.5647889275652135]}
5	1	1	{"vectorType":"dense","length":2,"values":[0.49476519185173606,0.505234808148264]}

I want to convert the dense vector to columns and store the output along with the remaining columns.

cust_id	label	prediction	split_int[0]	split_int[1]
1	0	0	0.574552891	0.425447109
2	0	0	0.5185219	0.4814781
3	0	1	0.378711147	0.621288853
4	0	1	0.435211072	0.564788928
5	1	1	0.494765192	0.505234808

I found some code online and was able to split the dense vector.

import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType

def split_array_to_list(col):
    def to_list(v):
        return v.toArray().tolist()
    return F.udf(to_list, ArrayType(DoubleType()))(col)

df3 = selected.select(split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])
df3.show()

How can I add other columns? I tried this but getting TypeError: 'Column' object is not callable

df3 = selected.select(F.col("cust_id") + ((split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])))

AdibP · Accepted Answer

Try withColumn when using your udf

df3 = selected.withColumn("split_int", split_array_to_list(F.col("probability"))).select(F.col("*"), *[F.col("split_int")[i] for i in range(2)])

Pyspark: Convert Dense vector to columns

Answers (1)

Related Questions