Reputation: 193
I used the following code:
df.withColumn("dense_vector", $"sparse_vector".toDense)
but it gives an error.
I am new to Spark, so this might be obvious and there may be obvious errors in my code line. Please help. Thank you!
Upvotes: 6
Views: 9309
Reputation: 330443
Contexts which require operation like this are relatively rare in Spark. With one or two exception Spark API expects common Vector
class not specific implementation (SparseVector
, DenseVector
). This is also true in case of distributed structures from o.a.s.mllib.linalg.distributed
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val df = Seq[(Long, Vector)](
(1L, Vectors.dense(1, 2, 3)), (2L, Vectors.sparse(3, Array(1), Array(3)))
).toDF("id", "v")
new RowMatrix(df.select("v")
.map(_.getAs[Vector]("v")))
.columnSimilarities(0.9)
.entries
.first
// apache.spark.mllib.linalg.distributed.MatrixEntry = MatrixEntry(0,2,1.0)
Nevertheless you could use an UDF like this:
val asDense = udf((v: Vector) => v.toDense)
df.withColumn("vd", asDense($"v")).show
// +---+-------------+-------------+
// | id| v| vd|
// +---+-------------+-------------+
// | 1|[1.0,2.0,3.0]|[1.0,2.0,3.0]|
// | 2|(3,[1],[3.0])|[0.0,3.0,0.0]|
// +---+-------------+-------------+
Just keep in mind that since version 2.0 Spark provides two different and compatible Vector
types:
o.a.s.ml.linalg.Vector
o.a.s.mllib.linalg.Vector
each with corresponding SQL UDT. See MatchError while accessing vector column in Spark 2.0
Upvotes: 8