Reputation: 225
Is there a built in way to create a sparse vector from a dense vector in PySpark? The way I am doing this is the following:
Vectors.sparse(len(denseVector), [(i,j) for i,j in enumerate(denseVector) if j != 0 ])
That satisfies the [size, (index, data)] format. Seems kinda hacky. Is there a more efficient way to do it?
Upvotes: 2
Views: 5215
Reputation: 4154
I'm not sure whether you're using mllib or ml. Anyway, You can convert like this:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
Upvotes: 1
Reputation: 425
import scipy.sparse
from pyspark.ml.linalg import Vectors, _convert_to_vector, VectorUDT
from pyspark.sql.functions import udf, col
If you have just one dense vector this will do it:
def dense_to_sparse(vector):
return _convert_to_vector(scipy.sparse.csc_matrix(vector.toArray()).T)
dense_to_sparse(densevector)
The trick here is that csc_matrix.shape[1] has to equal 1, so transpose the vector. Have a look at the source of _convert_to_vector: https://people.eecs.berkeley.edu/~jegonzal/pyspark/_modules/pyspark/mllib/linalg.html
The more likely scenario is you have a DF with a column of densevectors:
to_sparse = udf(dense_to_sparse, VectorUDT())
DF.withColumn("sparse", to_sparse(col("densevector"))
Upvotes: 7