Cannot create dataframe from list: pyspark

Question

I have a list that is generated by a function. when I execute print on my list:

print(preds_labels)

I obtain:

[(0.,8.),(0.,13.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,19.),(0.,20.),(0.,21.),(0.,23.)]

but when I want to create a DataFrame with this command:

df = sqlContext.createDataFrame(preds_labels, ["prediction", "label"])

I get an error message:

not supported type: type 'numpy.float64'

If I create the list manually, I have no problem. Do you have an idea?

shuaiyuancn · Accepted Answer

pyspark uses its own type system and unfortunately it doesn't deal with numpy well. It works with python types though. So you could manually convert the numpy.float64 to float like

df = sqlContext.createDataFrame(
    [(float(tup[0]), float(tup[1]) for tup in preds_labels], 
    ["prediction", "label"]
)

Note pyspark will then take them as pyspark.sql.types.DoubleType

Cannot create dataframe from list: pyspark

Answers (2)

Related Questions