Jan Sila
Jan Sila

Reputation: 1593

Creating Spark dataframe from numpy matrix

it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.

I've tried this:

%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

but I cannot get rid of :

TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector

I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.

Many thanks.

Upvotes: 18

Views: 27605

Answers (3)

Jeff Hernandez
Jeff Hernandez

Reputation: 2133

From Numpy to Pandas to Spark:

data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()

Output:

+-------------------+-------------------+------------------+-------------------+
|                  a|                  b|                 c|                  d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+

Upvotes: 12

desertnaut
desertnaut

Reputation: 60400

You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint when using spark-ml:

sc.version
# u'2.1.1'

import numpy as np
from pyspark.ml.linalg import Vectors

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(dff,schema=["label", "features"])

mydf.show(5)
# +-----+-------------+ 
# |label|     features| 
# +-----+-------------+ 
# |    1|[0.0,0.0,0.0]| 
# |    0|[0.0,1.0,1.0]| 
# |    0|[0.0,1.0,0.0]| 
# |    1|[0.0,0.0,1.0]| 
# |    0|[0.0,1.0,0.0]|
# +-----+-------------+

PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]

Upvotes: 8

Dat Tran
Dat Tran

Reputation: 2392

The problem is easy to solve. You're using the ml and the mllib API at the same time. Stick to one. Otherwise you get this error.

This is the solution for the mllibAPI:

import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint

df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)

mydf = spark.createDataFrame(df,["label", "features"])

For the ml API, you don't really need LabeledPoint anymore. Here is an example. I would suggest to use the ml API since the mllib API is going to deprecated soon.

Upvotes: 2

Related Questions