Reputation: 73
I am new to pyspark and trying to run below simple codes.
# create a RDD of LabeledPoint
bcData = MLUtils.loadLibSVMFile(sc, "breast-cancer.txt")
# convert it to DataFrame
bcDataFrame = ss.createDataFrame(bcData)
bcDataFrame.cache()
# split the data
(training_data, testing_data) = bcDataFrame.randomSplit([0.8, 0.2])
# create the model
dt_classifier = DecisionTreeClassifier(impurity="gini", maxDepth=2, labelCol="label", featuresCol="features")
dt_model = dt_classifier.fit(training_data)
When running, I get the following error at the last line.
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type struct< type:tinyint,size:int,indices:array< int >,values:array< double >> but was actually struct< type:tinyint,size:int,indices:array< int >,values:array< double >>.'
I am not sure why I am getting this error when the actual type of the Column "features" matches the expected exactly.
Upvotes: 3
Views: 6773
Reputation: 804
I had the same problem working in the following environment: Databricks, Spark 2.4.0, Scala 2.11
In my case the error was importing the wrong packages. When wrong I had:
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
The error was the second import (using the wrong Vectors class). The solution was to change the second import to:
import org.apache.spark.ml.linalg.Vectors
and voila!
Hope this gives you some clues about fixing it in python.
Upvotes: 4
Reputation: 41
I guess the root cause is that you may import both ml and mllib. I once had a similar message if I imported Vectors, SparseVector and VectorUDT. Some were imported from ml and some were imported from mllib. After I imported them from ml only, this error message was gone.
Upvotes: 0