Reputation: 2036
I am experiencing a very strange behaviour from VectorAssembler
and I was wondering if anyone else has seen this.
My scenario is pretty straightforward. I parse data from a CSV
file where I have some standard Int
and Double
fields and I also calculate some extra columns. My parsing function returns this:
val joined = countPerChannel ++ countPerSource //two arrays of Doubles joined
(label, orderNo, pageNo, Vectors.dense(joinedCounts))
My main function uses the parsing function like this:
val parsedData = rawData.filter(row => row != header).map(parseLine)
val data = sqlContext.createDataFrame(parsedData).toDF("label", "orderNo", "pageNo","joinedCounts")
I then use a VectorAssembler
like this:
val assembler = new VectorAssembler()
.setInputCols(Array("orderNo", "pageNo", "joinedCounts"))
.setOutputCol("features")
val assemblerData = assembler.transform(data)
So when I print a row of my data before it goes into the VectorAssembler
it looks like this:
[3.2,17.0,15.0,[0.0,0.0,0.0,0.0,3.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,2.0]]
After the transform function of VectorAssembler I print the same row of data and get this:
[3.2,(18,[0,1,6,9,14,17],[17.0,15.0,3.0,1.0,4.0,2.0])]
What on earth is going on? What has the VectorAssembler
done? I 've double checked all the calculations and even followed the simple Spark examples and cannot see what is wrong with my code. Can you?
Upvotes: 15
Views: 5208
Reputation: 40370
There is nothing strange about the output. Your vector seems to have lots of zero elements thus spark
used it’s sparse representation.
To explain further :
It seems like your vector is composed of 18 elements (dimension).
This indices [0,1,6,9,14,17]
from the vector contains non zero elements which are in order [17.0,15.0,3.0,1.0,4.0,2.0]
Sparse Vector representation is a way to save computational space thus easier and faster to compute. More on Sparse representation here.
Now of course you can convert that sparse representation to a dense representation but it comes at a cost.
In case you are interested in getting feature importance, thus I advise you to take a look at this.
Upvotes: 20