Reputation: 1345
I am using a vector assembler to transform a dataframe.
var stringAssembler = new VectorAssembler().setInputCols(encodedstringColumns).setOutputCol("stringFeatures")
df = stringAssembler.transform(df)
**var stringVectorSize = df.select("stringFeatures").head.size**
var stringPca = new PCA().setInputCol("stringFeatures").setOutputCol("pcaStringFeatures").setK(stringVectorSize).fit(output)
Now stringVectorSize will tell PCA how many columns to keep while performing pca. I am trying to get the size of the output sparse vector from the vector assembler but my code gives size = 1 which is wrong. What is the right code to get the size of a sparse vector which is the part of a dataframe column.
To put it plainly
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
|PetalLengthCm|PetalWidthCm|SepalLengthCm|SepalWidthCm| Id| Species|Species_Encoded| Id_Encoded| stringFeatures|
+-------------+------------+-------------+------------+---+-----------+---------------+-----------------+--------------------+
| 1.4| 0.2| 5.1| 3.5| 1|Iris-setosa| (2,[0],[1.0])| (149,[91],[1.0])|(151,[91,149],[1....|
| 1.4| 0.2| 4.9| 3.0| 2|Iris-setosa| (2,[0],[1.0])|(149,[119],[1.0])|(151,[119,149],[1...|
| 1.3| 0.2| 4.7| 3.2| 3|Iris-setosa| (2,[0],[1.0])|(149,[140],[1.0])|(151,[140,149],[1...|
For the above dataframe . I want to extract the size of stringFeatures sparse vector ( which is 151)
Upvotes: 2
Views: 7503
Reputation: 18042
If you read DataFrame's documentation you will notice that the head
method returns a Row
. Therefore, rather than obtaining your SparseVector
's size, you are obtaining Row
's size. Thus, to solve this you have to extract the element stored in the Row.
val row = df.select("stringFeatures").head
val vector = vector(0).asInstanceOf[SparseVector]
val size = vector.size
For instance:
import sqlContext.implicits._
import org.apache.spark.mllib.linalg.SparseVector
val df = sc.parallelize(Array(10,2,3,4)).toDF("n")
val pepe = udf((i: Int) => new SparseVector(i, Array(i-1), Array(i)))
val x = df.select(pepe(df("n")).as("n"))
x.show()
+---------------+
| n|
+---------------+
|(10,[9],[10.0])|
| (2,[1],[2.0])|
| (3,[2],[3.0])|
| (4,[3],[4.0])|
+---------------+
val y = x.select("n").head
y(0).asInstanceOf[SparseVector].size
res12: Int = 10
Upvotes: 3