Rajarshi Bhadra
Rajarshi Bhadra

Reputation: 1944

Sparse vector to dataframe in pyspark

I have sparsevector in pyspark which looks like this

SparseVector(5,{1:5,2:3,3:5,4:3,5:2})

How can I convert it to pandas dataframe with two columns which loks like this

ID VALUE
1   5
2   3
3   5
4   3
5   2

I tried sparsevector.zipWithIndex() but it did not work

Upvotes: 1

Views: 5643

Answers (1)

pault
pault

Reputation: 43504

Your example array is malformed, as you've specified 5 levels so there can not be an index 5. After you fix that issue, you can simply call toArray() which will return a numpy.ndarray. Just pass that into the constructor for a pandas.DataFrame.

from  pyspark.mllib.linalg import SparseVector  # code works the same
#from pyspark.ml.linalg import SparseVector     # code works the same

import pandas as pd

a = SparseVector(5,{0:5,1:3,2:5,3:3,4:2})  # note the index starts at 0
df = pd.DataFrame(a.toArray())
print(df)
#     0
#0  5.0
#1  3.0
#2  5.0
#3  3.0
#4  2.0

The code works the same whether you're working with pyspark.mllib.linalg.SparseVector or pyspark.ml.linalg.SparseVector.

Upvotes: 3

Related Questions