Reputation:
I have applied the pyspark tf-idf functions and get back the following results.
| features |
|----------|
| (35,[7,9,11,12,19,26,33],[1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003,1.6094379124341003,1.6094379124341003,1.6094379124341003]) |
| (35,[0,2,4,5,6,11,22],[0.9162907318741551,0.9162907318741551,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.2039728043259361,1.6094379124341003]) |
So a dataframe having 1 column (features) which contains SparseVectors as rows.
Now i want to build the IndexRowMatrix from this dataframe so that i can run the svd function which is described over here https://spark.apache.org/docs/latest/api/python/pyspark.mllib.html?highlight=svd#pyspark.mllib.linalg.distributed.IndexedRowMatrix.computeSVD
I have tried the following but didn't work:
mat = RowMatrix(tfidfData.rdd.map(lambda x: x.features))
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
I used RowMatrix because to construct it i don't have to provide tuple but i can't even build RowMatrix. IndexedRowMatrix will be more difficult for me.
So how to run the IndexedRowMatrix on the out put of tf-idf dataframe in pyspark ?
Upvotes: 2
Views: 2345
Reputation: 56
Please excuse for not commenting in the original answer, I don't have requisite reputation points yet. To speed up things it would be beneficial to create a mllib.linalg.SparseVector
. Its really straightforward if the following modification is made:
from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.fromML(v.rawFeatures)))
Upvotes: 1
Reputation:
I am able to solve it.
So as error suggested that RowMatrix won't accept pyspark.ml.linalg.SparseVector
vector, So I converted this vector into pyspark.mllib.linalg
Pay attention to ml
and mllib
. Now the following is the code snippet which will convert TF-IDF output to RowMatrix and you apply computeSVD method on it.
from pyspark.mllib.linalg import Vectors
mat = RowMatrix(df.rdd.map(lambda v: Vectors.dense(v.rawFeatures.toArray()) ))
I have converted to Dense matrix but you can write some extra lines of code to convert ml.linalg.SparseVector
into mllib.linalg.SparseVector
Upvotes: 3