Reputation: 3811
I have a dataframe as below called as training
:
+------------------+------+
| features| MEDV|
+------------------+------+
| [6.575,4.98,15.3]|504000|
| [6.421,9.14,17.8]|453600|
| [7.185,4.03,17.8]|728700|
| [6.998,2.94,18.7]|701400|
I run a linear regression on this dataset
from pyspark.ml.regression import LinearRegression
lr=LinearRegression(featuresCol='features',
predictionCol='predictions')
lrModel=lr.fit(training)
Error:
Py4JJavaError: An error occurred while calling o51.fit.
: java.lang.IllegalArgumentException: label does not exist. Available: features, MEDV
at org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:275)
at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:274)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:75)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.PredictorParams.validateAndTransformSchema$(Predictor.scala:46)
at org.apache.spark.ml.regression.LinearRegression.org$apache$spark$ml$regression$LinearRegressionParams$$super$validateAndTransformSchema(LinearRegression.scala:176)
at org.apache.spark.ml.regression.LinearRegressionParams.validateAndTransformSchema(LinearRegression.scala:119)
at org.apache.spark.ml.regression.LinearRegressionParams.validateAndTransformSchema$(LinearRegression.scala:107)
at org.apache.spark.ml.regression.LinearRegression.validateAndTransformSchema(LinearRegression.scala:176)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:178)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:75)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:134)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:116)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
What is this label that doesnt exist
Upvotes: 0
Views: 726
Reputation: 14845
The parameter name for the label column is called labelCol
. The default value for labelCol
is label
. This is the reason that Spark tries to read a column called label
that does not exist.
Replacing predictionCol='predictions'
with labelCol='MEDV'
should fix the problem.
Here is the link to the API docs.
Upvotes: 1