Sparse vector to dataframe in pyspark

Question

I have sparsevector in pyspark which looks like this

SparseVector(5,{1:5,2:3,3:5,4:3,5:2})

How can I convert it to pandas dataframe with two columns which loks like this

I tried sparsevector.zipWithIndex() but it did not work

pault · Accepted Answer

Your example array is malformed, as you've specified 5 levels so there can not be an index 5. After you fix that issue, you can simply call toArray() which will return a numpy.ndarray. Just pass that into the constructor for a pandas.DataFrame.

from  pyspark.mllib.linalg import SparseVector  # code works the same
#from pyspark.ml.linalg import SparseVector     # code works the same

import pandas as pd

a = SparseVector(5,{0:5,1:3,2:5,3:3,4:2})  # note the index starts at 0
df = pd.DataFrame(a.toArray())
print(df)
#     0
#0  5.0
#1  3.0
#2  5.0
#3  3.0
#4  2.0

The code works the same whether you're working with pyspark.mllib.linalg.SparseVector or pyspark.ml.linalg.SparseVector.

Sparse vector to dataframe in pyspark

Answers (1)

Related Questions