Larissa Leite
Larissa Leite

Reputation: 1368

org.apache.spark.SparkException: Unseen label with TrainValidationSplit

I was searching for this error but I haven't found anything related to TrainValidationSplit. So I wanna do parameter tuning and doing so with TrainValidationSplit gives the following error: org.apache.spark.SparkException: Unseen label.

I understand why this happens and increasing the trainRatio mitigates the problem but does not completely solve it. For that matter, this is (part of) the code:

stages = []
for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    stages += [stringIndexer]

assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]

labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]

dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]

evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')

paramGrid = (ParamGridBuilder()
             .addGrid(dt.maxDepth, [1,2,6])
             .addGrid(dt.maxBins, [20,40])
             .build())

pipeline = Pipeline(stages=stages)

trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)

model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)

I have seen this answer but I am not sure whether it also applies to my case and I am wondering if there's a more appropriate solution. Please, help?

Upvotes: 0

Views: 1376

Answers (1)

Ida
Ida

Reputation: 2999

The Unseen label exception is usually associated with StringIndexer.

You split the data into training (95%) and validation (5%) dataset. I think there are some category values (in the categoricalCol columns) that appear in the training data but do not appear in the validation set.

Therefore, during the string indexing stage in the validation process, the StringIndexer sees an unseen label and throws that exception. By increasing the training ratio, you increase the chance that category values in the training set are a superset of that in the validation set, but this is only a workaround since there is no guarantee.

One possible solution: fit the StringIndexer with train_dataset first, and add the resulting StringIndexerModel to the pipeline stages. This way the StringIndexer would see all the possible category values.

for categoricalCol in categoricalCols:
    stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
    strIndexModel = stringIndexer.fit(train_dataset)
    stages += [strIndexModel]

Upvotes: 2

Related Questions