sheIsTrue
sheIsTrue

Reputation: 73

How to fix: pyspark.sql.utils.IllegalArgumentException: incorrect type for Column features?

I am new to pyspark and trying to run below simple codes.

# create a RDD of LabeledPoint
bcData = MLUtils.loadLibSVMFile(sc, "breast-cancer.txt")

# convert it to DataFrame
bcDataFrame = ss.createDataFrame(bcData)
bcDataFrame.cache()

# split the data
(training_data, testing_data) = bcDataFrame.randomSplit([0.8, 0.2])

# create the model
dt_classifier = DecisionTreeClassifier(impurity="gini", maxDepth=2, labelCol="label", featuresCol="features")
dt_model = dt_classifier.fit(training_data)

When running, I get the following error at the last line.

pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type struct< type:tinyint,size:int,indices:array< int >,values:array< double >> but was actually struct< type:tinyint,size:int,indices:array< int >,values:array< double >>.'

I am not sure why I am getting this error when the actual type of the Column "features" matches the expected exactly.

Upvotes: 3

Views: 6773

Answers (2)

I had the same problem working in the following environment: Databricks, Spark 2.4.0, Scala 2.11

In my case the error was importing the wrong packages. When wrong I had:

import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.linalg.Vectors

The error was the second import (using the wrong Vectors class). The solution was to change the second import to:

import org.apache.spark.ml.linalg.Vectors

and voila!

Hope this gives you some clues about fixing it in python.

Upvotes: 4

Alex Chang
Alex Chang

Reputation: 41

I guess the root cause is that you may import both ml and mllib. I once had a similar message if I imported Vectors, SparseVector and VectorUDT. Some were imported from ml and some were imported from mllib. After I imported them from ml only, this error message was gone.

Upvotes: 0

Related Questions