Leothorn
Leothorn

Reputation: 1345

Size of the sparse vector in the column of a data-frame in Apache scala spark

I am using a vector assembler to transform a dataframe.

var stringAssembler = new VectorAssembler().setInputCols(encodedstringColumns).setOutputCol("stringFeatures")
df = stringAssembler.transform(df)
**var stringVectorSize = df.select("stringFeatures").head.size**
var stringPca = new PCA().setInputCol("stringFeatures").setOutputCol("pcaStringFeatures").setK(stringVectorSize).fit(output)

Now stringVectorSize will tell PCA how many columns to keep while performing pca. I am trying to get the size of the output sparse vector from the vector assembler but my code gives size = 1 which is wrong. What is the right code to get the size of a sparse vector which is the part of a dataframe column.

To put it plainly

+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|PetalLengthCm|PetalWidthCm|SepalLengthCm|SepalWidthCm| Id|    Species|Species_Encoded|       Id_Encoded|      stringFeatures|
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|          1.4|         0.2|          5.1|         3.5|  1|Iris-setosa|  (2,[0],[1.0])| (149,[91],[1.0])|(151,[91,149],[1....|
|          1.4|         0.2|          4.9|         3.0|  2|Iris-setosa|  (2,[0],[1.0])|(149,[119],[1.0])|(151,[119,149],[1...|
|          1.3|         0.2|          4.7|         3.2|  3|Iris-setosa|  (2,[0],[1.0])|(149,[140],[1.0])|(151,[140,149],[1...|

For the above dataframe . I want to extract the size of stringFeatures sparse vector ( which is 151)

Upvotes: 2

Views: 7503

Answers (1)

Alberto Bonsanto
Alberto Bonsanto

Reputation: 18042

If you read DataFrame's documentation you will notice that the head method returns a Row. Therefore, rather than obtaining your SparseVector's size, you are obtaining Row's size. Thus, to solve this you have to extract the element stored in the Row.

val row = df.select("stringFeatures").head 
val vector = vector(0).asInstanceOf[SparseVector]
val size = vector.size

For instance:

import sqlContext.implicits._
import org.apache.spark.mllib.linalg.SparseVector

val df = sc.parallelize(Array(10,2,3,4)).toDF("n")
val pepe = udf((i: Int) => new SparseVector(i, Array(i-1), Array(i)))
val x = df.select(pepe(df("n")).as("n"))

x.show()

+---------------+
|              n|
+---------------+
|(10,[9],[10.0])|
|  (2,[1],[2.0])|
|  (3,[2],[3.0])|
|  (4,[3],[4.0])|
+---------------+

val y = x.select("n").head

y(0).asInstanceOf[SparseVector].size
res12: Int = 10

Upvotes: 3

Related Questions