Reputation: 819
How to convert from
org.apache.spark.mllib.linalg.SparseVector
to org.apache.spark.ml.linalg.SparseVector
?
I am converting the code from from mllib
to the ml
api.
import org.apache.spark.mllib.linalg.{DenseVector, Vector}
import org.apache.spark.ml.linalg.{DenseVector => NewDenseVector, Vector => NewVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.{LabeledPoint => NewLabeledPoint}
val labelPointData = limitedTable.rdd.map { row =>
new NewLabeledPoint(convertToDouble(row.head), row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector])
}
statement row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector]
is not working because of the following exception:
org.apache.spark.mllib.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector
How to overcome that?
I have found code converting from the mllib
to ml
but not viceversa.
Upvotes: 4
Views: 3208
Reputation: 4156
In pyspark, you can convert different vectors to other vectors in this way:
from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors
# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
print('v1: %s' % v1)
print('v2: %s' % v2)
print(v1 == v2)
print(type(v1), type(v2))
# Convert vector to numpy array
arr1 = v1.toArray()
print('arr1: %s type: %s' % (arr1, type(arr1)))
# convert mllib vectors to ml vectors
v3 = ml_vectors.dense(arr1)
print('v3: %s' % v3)
print(type(v3))
# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)
v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)
# Convert ml sparse vector to dense vector
v5 = ml_vectors.dense(v4.toArray())
print('v5: %s' % v5)
# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)
# Convert ml sparse vector to mllib sparse vector
arr3 = v4.toArray()
d = {i:arr3[i] for i in np.nonzero(arr3)[0]}
v7 = mllib_vectors.sparse(len(arr3), d)
print('v7: %s' % v7)
The output is:
v1: [1.0,1.0,0.0,0.0,0.0]
v2: [1.0,1.0,0.0,0.0,0.0]
False
<class 'pyspark.mllib.linalg.DenseVector'> <class 'pyspark.ml.linalg.DenseVector'>
arr1: [1. 1. 0. 0. 0.] type: <class 'numpy.ndarray'>
v3: [1.0,1.0,0.0,0.0,0.0]
<class 'pyspark.ml.linalg.DenseVector'>
arr2 [1. 1. 0. 0. 0.]
d {0: 1.0, 1: 1.0}
v4: (5,[0,1],[1.0,1.0])
v5: [1.0,1.0,0.0,0.0,0.0]
v6: (5,[0,1],[1.0,1.0])
v7: (5,[0,1],[1.0,1.0])
Upvotes: 0
Reputation: 28322
It is possible to convert in both directions. First, let's create an mllib SparseVector
:
import org.apache.spark.mllib.linalg.Vectors
val mllibVec: org.apache.spark.mllib.linalg.Vector = Vectors.sparse(3, Array(1,2,3), Array(1,2,3))
To convert to ML SparseVector
, simply use asML
:
val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML
To convert it back again, the easiest way is to use Vectors.fromML()
:
val mllibVec2: org.apache.spark.mllib.linalg.Vector = Vectors.fromML(mlVec)
In addition, in your code, instead of row(1).asInstanceOf[SparseVector]
you could try row.getAs[SparseVector](1)
. Try reading the vector as a mllib
vector, then convert it with asML
and pass into the ML-based LabeledPoint
, i.e.:
val labelPointData = limitedTable.rdd.map { row =>
NewLabeledPoint(convertToDouble(row.head), row.getAs[org.apache.spark.mllb.linalg.SparseVector](1).asML)
}
Upvotes: 10