How to access spark sparse vector element

Question

I have a sparse vector column obtained through OneHotEncoder in a spark dataframe, basically looking like this showing the first 10 rows:

+------------------------------------+
|check_indexed_encoded               |
+------------------------------------+
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[2],[1.0])|
|                       (3,[0],[1.0])|
+------------------------------------+
only showing top 10 rows

I am trying to access these elements to basically convert it back into (normally) hot encoded dummies to be able to convert the entire frame without issues into Pandas. Within spark I tried using .GetItem and .element but this throws also an error message "Can't extract value: need struct type". Any ideas how to get the values from that? Thanks!

Majte · Accepted Answer

You could use an UDF. This should do it:

import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
from pyspark.sql.types import ArrayType

vector_udf = F.udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
df = df.withColumn("check_indexed_encoded_0", vector_udf(train["check_indexed_encoded"]).getItem(0))

For accessing the 2nd elements use getItem(1) etc.

How to access spark sparse vector element

Answers (1)

Related Questions