Reputation: 89
I have a data frame with four columns, with one column being a dense vector.
cust_id | label | prediction | probability |
---|---|---|---|
1 | 0 | 0 | {"vectorType":"dense","length":2,"values":[0.5745528913772013,0.4254471086227987]} |
2 | 0 | 0 | {"vectorType":"dense","length":2,"values":[0.5185219003114524,0.4814780996885476]} |
3 | 0 | 1 | {"vectorType":"dense","length":2,"values":[0.37871114732242217,0.6212888526775778]} |
4 | 0 | 1 | {"vectorType":"dense","length":2,"values":[0.4352110724347864,0.5647889275652135]} |
5 | 1 | 1 | {"vectorType":"dense","length":2,"values":[0.49476519185173606,0.505234808148264]} |
I want to convert the dense vector to columns and store the output along with the remaining columns.
cust_id | label | prediction | split_int[0] | split_int[1] |
---|---|---|---|---|
1 | 0 | 0 | 0.574552891 | 0.425447109 |
2 | 0 | 0 | 0.5185219 | 0.4814781 |
3 | 0 | 1 | 0.378711147 | 0.621288853 |
4 | 0 | 1 | 0.435211072 | 0.564788928 |
5 | 1 | 1 | 0.494765192 | 0.505234808 |
I found some code online and was able to split the dense vector.
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, DoubleType
def split_array_to_list(col):
def to_list(v):
return v.toArray().tolist()
return F.udf(to_list, ArrayType(DoubleType()))(col)
df3 = selected.select(split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])
df3.show()
How can I add other columns? I tried this but getting TypeError: 'Column' object is not callable
df3 = selected.select(F.col("cust_id") + ((split_array_to_list(F.col("probability")).alias("split_int")).select([F.col("split_int")[i] for i in range(2)])))
Upvotes: 1
Views: 1154
Reputation: 2939
Try withColumn
when using your udf
df3 = selected.withColumn("split_int", split_array_to_list(F.col("probability"))).select(F.col("*"), *[F.col("split_int")[i] for i in range(2)])
Upvotes: 2