cipri.l
cipri.l

Reputation: 819

How to convert from org.apache.spark.mllib.linalg.SparseVector to org.apache.spark.ml.linalg.SparseVector?

How to convert from org.apache.spark.mllib.linalg.SparseVector to org.apache.spark.ml.linalg.SparseVector?

I am converting the code from from mllib to the ml api.

import org.apache.spark.mllib.linalg.{DenseVector, Vector}
import org.apache.spark.ml.linalg.{DenseVector => NewDenseVector, Vector => NewVector}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.{LabeledPoint => NewLabeledPoint}

val labelPointData = limitedTable.rdd.map { row =>
  new NewLabeledPoint(convertToDouble(row.head), row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector])
}

statement row(1).asInstanceOf[org.apache.spark.ml.linalg.SparseVector] is not working because of the following exception:

org.apache.spark.mllib.linalg.SparseVector cannot be cast to org.apache.spark.ml.linalg.SparseVector

How to overcome that?

I have found code converting from the mllib to ml but not viceversa.

Upvotes: 4

Views: 3208

Answers (2)

DennisLi
DennisLi

Reputation: 4156

In pyspark, you can convert different vectors to other vectors in this way:

from pyspark.mllib.linalg import Vectors as mllib_vectors
from pyspark.ml.linalg import Vectors as ml_vectors

# Construct dense vectors in mllib and ml
v1 = mllib_vectors.dense([1.0, 1.0, 0, 0, 0])
v2 = ml_vectors.dense([1.0, 1.0, 0, 0, 0])
print('v1: %s' % v1)
print('v2: %s' % v2)
print(v1 == v2)
print(type(v1), type(v2))

# Convert vector to numpy array
arr1 = v1.toArray()
print('arr1: %s type: %s' % (arr1, type(arr1)))

# convert mllib vectors to ml vectors
v3 = ml_vectors.dense(arr1)
print('v3: %s' % v3)
print(type(v3))


# Convert ml dense vector to sparse vector
arr2 = v2.toArray()
print('arr2', arr2)
d = {i:arr2[i] for i in np.nonzero(arr2)[0]}
print('d', d)

v4 = ml_vectors.sparse(len(arr2), d)
print('v4: %s' % v4)


# Convert ml sparse vector to dense vector
v5 = ml_vectors.dense(v4.toArray())
print('v5: %s' % v5)


# Convert mllib dense vector to sparse vector
v6 = ml_vectors.sparse(len(arr2), d)
print('v6: %s' % v6)


# Convert ml sparse vector to mllib sparse vector
arr3 = v4.toArray()
d = {i:arr3[i] for i in np.nonzero(arr3)[0]}
v7 = mllib_vectors.sparse(len(arr3), d)
print('v7: %s' % v7)

The output is:

v1: [1.0,1.0,0.0,0.0,0.0]
v2: [1.0,1.0,0.0,0.0,0.0]
False
<class 'pyspark.mllib.linalg.DenseVector'> <class 'pyspark.ml.linalg.DenseVector'>
arr1: [1. 1. 0. 0. 0.] type: <class 'numpy.ndarray'>
v3: [1.0,1.0,0.0,0.0,0.0]
<class 'pyspark.ml.linalg.DenseVector'>
arr2 [1. 1. 0. 0. 0.]
d {0: 1.0, 1: 1.0}
v4: (5,[0,1],[1.0,1.0])
v5: [1.0,1.0,0.0,0.0,0.0]
v6: (5,[0,1],[1.0,1.0])
v7: (5,[0,1],[1.0,1.0])

Upvotes: 0

Shaido
Shaido

Reputation: 28322

It is possible to convert in both directions. First, let's create an mllib SparseVector:

import org.apache.spark.mllib.linalg.Vectors
val mllibVec: org.apache.spark.mllib.linalg.Vector = Vectors.sparse(3, Array(1,2,3), Array(1,2,3))

To convert to ML SparseVector, simply use asML:

val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML

To convert it back again, the easiest way is to use Vectors.fromML():

val mllibVec2: org.apache.spark.mllib.linalg.Vector = Vectors.fromML(mlVec)

In addition, in your code, instead of row(1).asInstanceOf[SparseVector] you could try row.getAs[SparseVector](1). Try reading the vector as a mllib vector, then convert it with asML and pass into the ML-based LabeledPoint, i.e.:

val labelPointData = limitedTable.rdd.map { row =>
  NewLabeledPoint(convertToDouble(row.head), row.getAs[org.apache.spark.mllb.linalg.SparseVector](1).asML)
}

Upvotes: 10

Related Questions