org.apache.spark.SparkException: Unseen label with TrainValidationSplit

Question

I was searching for this error but I haven't found anything related to TrainValidationSplit. So I wanna do parameter tuning and doing so with TrainValidationSplit gives the following error: org.apache.spark.SparkException: Unseen label.

I understand why this happens and increasing the trainRatio mitigates the problem but does not completely solve it. For that matter, this is (part of) the code:

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

pipeline = Pipeline(stages=stages)

trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)

model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)

I have seen this answer but I am not sure whether it also applies to my case and I am wondering if there's a more appropriate solution. Please, help?

org.apache.spark.SparkException: Unseen label with TrainValidationSplit

Answers (1)

Related Questions