Reputation: 1368
I was searching for this error but I haven't found anything related to TrainValidationSplit
. So I wanna do parameter tuning and doing so with TrainValidationSplit
gives the following error: org.apache.spark.SparkException: Unseen label
.
I understand why this happens and increasing the trainRatio
mitigates the problem but does not completely solve it.
For that matter, this is (part of) the code:
stages = []
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
stages += [stringIndexer]
assemblerInputs = [x+"Index" for x in categoricalCols] + numericCols
assembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
stages += [assembler]
labelIndexer = StringIndexer(inputCol='label', outputCol='indexedLabel')
stages += [labelIndexer]
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="features")
stages += [dt]
evaluator = MulticlassClassificationEvaluator(labelCol='indexedLabel', predictionCol='prediction', metricName='f1')
paramGrid = (ParamGridBuilder()
.addGrid(dt.maxDepth, [1,2,6])
.addGrid(dt.maxBins, [20,40])
.build())
pipeline = Pipeline(stages=stages)
trainValidationSplit = TrainValidationSplit(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, trainRatio=0.95)
model = trainValidationSplit.fit(train_dataset)
train_dataset= model.transform(train_dataset)
I have seen this answer but I am not sure whether it also applies to my case and I am wondering if there's a more appropriate solution. Please, help?
Upvotes: 0
Views: 1376
Reputation: 2999
The Unseen label
exception is usually associated with StringIndexer
.
You split the data into training (95%) and validation (5%) dataset. I think there are some category values (in the categoricalCol
columns) that appear in the training data but do not appear in the validation set.
Therefore, during the string indexing stage in the validation process, the StringIndexer
sees an unseen label and throws that exception. By increasing the training ratio, you increase the chance that category values in the training set are a superset of that in the validation set, but this is only a workaround since there is no guarantee.
One possible solution: fit
the StringIndexer
with train_dataset
first, and add the resulting StringIndexerModel
to the pipeline stages. This way the StringIndexer
would see all the possible category values.
for categoricalCol in categoricalCols:
stringIndexer = StringIndexer(inputCol=categoricalCol, outputCol=categoricalCol+"Index")
strIndexModel = stringIndexer.fit(train_dataset)
stages += [strIndexModel]
Upvotes: 2