Spark GBTClassifier always predicts with 100% accuracy

Question

I use SparkML GBTClassifier to train on a wide-feature dataset for binary classification problem:

Xtrain.select(labelCol).groupBy(labelCol).count().orderBy(labelCol).show()
+-----+------+
|label| count|
+-----+------+
|    0|631608|
|    1| 18428|
+-----+------+

va = VectorAssembler(inputCols=col_header, outputCol="features")
tr = GBTClassifier(labelCol=labelCol, featuresCol="features", maxIter=30, maxDepth=5, seed=420)
pipeline = Pipeline(stages=[va, tr])
model = pipeline.fit(Xtrain)

The classifier runs very fast (unusual) and learns with 100% accuracy, more over testing set is also predicted with 100% accuracy. When I print

model.stages[1].featureImportances
SparseVector(29, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, 22: 0.0, 23: 0.0, 24: 1.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0})

I notice that one feature (#24 in this case) in my DataFrame contributed 100% weight into the model. When I remove this field and retrain, I see the same picture the only difference is second field is now contributing along to the model and I get 100% accuracy. Obviously something is not right about this, what is it?

zero323 · Accepted Answer

The most common cause of behavior like this one on non-degenerate dataset is data leakage. Data leakage can take different forms, but considering

that one feature (#24 in this case) in my DataFrame contributed 100% weight

we can significantly narrow things down:

A simple coding mistake - you've included label (or transformed label) among features. You should double check your processing pipeline.
Original data contains features which has been used to derive label or derived from label. You should check data dictionary (if present) or other available sources, to determine which features should be discarded from your model (in general look for anything, that you wouldn't expect in a raw data).

Spark GBTClassifier always predicts with 100% accuracy

Answers (1)

Related Questions