Reputation: 1593
it is my first time with PySpark, (Spark 2), and I'm trying to create a toy dataframe for a Logit model. I ran successfully the tutorial and would like to pass my own data into it.
I've tried this:
%pyspark
import numpy as np
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(df,["label", "features"])
but I cannot get rid of :
TypeError: Cannot convert type <class 'pyspark.ml.linalg.DenseVector'> into Vector
I'm using the ML library for vector and the input is a double array, so what's the catch, please? It should be fine according to the documentation.
Many thanks.
Upvotes: 18
Views: 27605
Reputation: 2133
From Numpy to Pandas to Spark:
data = np.random.rand(4,4)
df = pd.DataFrame(data, columns=list('abcd'))
spark.createDataFrame(df).show()
Output:
+-------------------+-------------------+------------------+-------------------+
| a| b| c| d|
+-------------------+-------------------+------------------+-------------------+
| 0.8026427193838694|0.16867056812634307|0.2284873209015007|0.17141853164400833|
| 0.2559088794287595| 0.3896957084615589|0.3806810025185623| 0.9362280141470332|
|0.41313827425060257| 0.8087580640179158|0.5547653674054028| 0.5386190454838264|
| 0.2948395900484454| 0.4085807623354264|0.6814694724946697|0.32031773805256325|
+-------------------+-------------------+------------------+-------------------+
Upvotes: 12
Reputation: 60400
You are mixing functionality from ML and MLlib, which are not necessarily compatible. You don't need a LabeledPoint
when using spark-ml
:
sc.version
# u'2.1.1'
import numpy as np
from pyspark.ml.linalg import Vectors
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
dff = map(lambda x: (int(x[0]), Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(dff,schema=["label", "features"])
mydf.show(5)
# +-----+-------------+
# |label| features|
# +-----+-------------+
# | 1|[0.0,0.0,0.0]|
# | 0|[0.0,1.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# | 1|[0.0,0.0,1.0]|
# | 0|[0.0,1.0,0.0]|
# +-----+-------------+
PS: As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package. [ref.]
Upvotes: 8
Reputation: 2392
The problem is easy to solve. You're using the ml
and the mllib
API at the same time. Stick to one. Otherwise you get this error.
This is the solution for the mllib
API:
import numpy as np
from pyspark.mllib.linalg import Vectors, VectorUDT
from pyspark.mllib.regression import LabeledPoint
df = np.concatenate([np.random.randint(0,2, size=(1000)), np.random.randn(1000), 3*np.random.randn(1000)+2, 6*np.random.randn(1000)-2]).reshape(1000,-1)
df = map(lambda x: LabeledPoint(x[0], Vectors.dense(x[1:])), df)
mydf = spark.createDataFrame(df,["label", "features"])
For the ml
API, you don't really need LabeledPoint
anymore. Here is an example. I would suggest to use the ml
API since the mllib
API is going to deprecated soon.
Upvotes: 2