Reputation: 3257
I am very new to using PySpark. I have a column of SparseVectors in my PySpark dataframe.
rescaledData.select('features').show(5,False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(262144,[43953,62425,66522,148962,174441,249180],[3.9219733362813143,3.9219733362813143,1.213923135179104,3.9219733362813143,3.9219733362813143,0.5720692490067093])|
|(262144,[57925,66522,90939,249180],[3.5165082281731497,1.213923135179104,3.9219733362813143,0.5720692490067093]) |
|(262144,[23366,45531,73408,211290],[2.6692103677859462,3.005682604407159,3.5165082281731497,3.228826155721369]) |
|(262144,[30913,81939,99546,137643,162885,249180],[3.228826155721369,3.9219733362813143,3.005682604407159,3.005682604407159,3.228826155721369,1.1441384980134186]) |
|(262144,[108134,152329,249180],[3.9219733362813143,2.6692103677859462,2.8603462450335466]) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I need to convert the above dataframe into a matrix where every row in the matrix corresponds to a SparseVector in that exact row in the dataframe.
for example,
+-----------------+
|features |
+-----------------+
|(7,[1,2],[45,63])|
|(7,[3,5],[85,69])|
|(7,[1,2],[89,56])|
+-----------------+
Must be converted to
[[0,45,63,0,0,0,0]
[0,0,0,85,0,69,0]
[0,89,56,0,0,0,0]]
I have read the link below, which shows that there is a function toArray()
which does exactly what I want.
https://mingchen0919.github.io/learning-apache-spark/pyspark-vectors.html
However, I am having trouble using it.
vector_udf = udf(lambda vector: vector.toArray())
rescaledData.withColumn('features_', vector_udf(rescaledData.features)).first()
I need it to convert every row into an array and then convert the PySpark dataframe into a matrix.
Upvotes: 4
Views: 11202
Reputation: 5870
toArray() will return numpy array. we can convert to list and then collect the dataframe.
from pyspark.sql.types import *
vector_udf = udf(lambda vector: vector.toArray().tolist(),ArrayType(DoubleType()))
df.show() ## my sample dataframe
+-------------------+
| features|
+-------------------+
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
|(4,[1,3],[3.0,4.0])|
+-------------------+
colvalues = df.select(vector_udf('features').alias('features')).collect()
list(map(lambda x:x.features,colvalues))
[[0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0], [0.0, 3.0, 0.0, 4.0]]
Upvotes: 7
Reputation: 35229
Convert to RDD
and map
:
vectors = df.select("features").rdd.map(lambda row: row.features)
Convert result to distributed matrix:
from pyspark.mllib.linalg.distributed import RowMatrix
matrix = RowMatrix(vectors)
If you want DenseVectors
(memory requirements!):
vectors = df.select("features").rdd.map(lambda row: row.features.toArray())
Upvotes: 7