Dimon Buzz
Dimon Buzz

Reputation: 1308

Spark GBTClassifier always predicts with 100% accuracy

I use SparkML GBTClassifier to train on a wide-feature dataset for binary classification problem:

Xtrain.select(labelCol).groupBy(labelCol).count().orderBy(labelCol).show()
+-----+------+
|label| count|
+-----+------+
|    0|631608|
|    1| 18428|
+-----+------+

va = VectorAssembler(inputCols=col_header, outputCol="features")
tr = GBTClassifier(labelCol=labelCol, featuresCol="features", maxIter=30, maxDepth=5, seed=420)
pipeline = Pipeline(stages=[va, tr])
model = pipeline.fit(Xtrain)

The classifier runs very fast (unusual) and learns with 100% accuracy, more over testing set is also predicted with 100% accuracy. When I print

model.stages[1].featureImportances
SparseVector(29, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, 22: 0.0, 23: 0.0, 24: 1.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0})

I notice that one feature (#24 in this case) in my DataFrame contributed 100% weight into the model. When I remove this field and retrain, I see the same picture the only difference is second field is now contributing along to the model and I get 100% accuracy. Obviously something is not right about this, what is it?

Upvotes: 1

Views: 796

Answers (1)

zero323
zero323

Reputation: 330353

The most common cause of behavior like this one on non-degenerate dataset is data leakage. Data leakage can take different forms, but considering

that one feature (#24 in this case) in my DataFrame contributed 100% weight

we can significantly narrow things down:

  • A simple coding mistake - you've included label (or transformed label) among features. You should double check your processing pipeline.
  • Original data contains features which has been used to derive label or derived from label. You should check data dictionary (if present) or other available sources, to determine which features should be discarded from your model (in general look for anything, that you wouldn't expect in a raw data).

Upvotes: 1

Related Questions