Reputation: 475
I have a sparse vector column obtained through OneHotEncoder in a spark dataframe, basically looking like this showing the first 10 rows:
+------------------------------------+
|check_indexed_encoded |
+------------------------------------+
| (3,[2],[1.0])|
| (3,[0],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[2],[1.0])|
| (3,[0],[1.0])|
+------------------------------------+
only showing top 10 rows
I am trying to access these elements to basically convert it back into (normally) hot encoded dummies to be able to convert the entire frame without issues into Pandas. Within spark I tried using .GetItem and .element but this throws also an error message "Can't extract value: need struct type". Any ideas how to get the values from that? Thanks!
Upvotes: 4
Views: 1577
Reputation: 284
You could use an UDF. This should do it:
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
from pyspark.sql.types import ArrayType
vector_udf = F.udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
df = df.withColumn("check_indexed_encoded_0", vector_udf(train["check_indexed_encoded"]).getItem(0))
For accessing the 2nd elements use getItem(1) etc.
Upvotes: 1