Reputation: 1308
I use SparkML GBTClassifier to train on a wide-feature dataset for binary classification problem:
Xtrain.select(labelCol).groupBy(labelCol).count().orderBy(labelCol).show()
+-----+------+
|label| count|
+-----+------+
| 0|631608|
| 1| 18428|
+-----+------+
va = VectorAssembler(inputCols=col_header, outputCol="features")
tr = GBTClassifier(labelCol=labelCol, featuresCol="features", maxIter=30, maxDepth=5, seed=420)
pipeline = Pipeline(stages=[va, tr])
model = pipeline.fit(Xtrain)
The classifier runs very fast (unusual) and learns with 100% accuracy, more over testing set is also predicted with 100% accuracy. When I print
model.stages[1].featureImportances
SparseVector(29, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.0, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.0, 10: 0.0, 11: 0.0, 12: 0.0, 13: 0.0, 14: 0.0, 15: 0.0, 16: 0.0, 17: 0.0, 18: 0.0, 19: 0.0, 20: 0.0, 21: 0.0, 22: 0.0, 23: 0.0, 24: 1.0, 25: 0.0, 26: 0.0, 27: 0.0, 28: 0.0})
I notice that one feature (#24 in this case) in my DataFrame contributed 100% weight into the model. When I remove this field and retrain, I see the same picture the only difference is second field is now contributing along to the model and I get 100% accuracy. Obviously something is not right about this, what is it?
Upvotes: 1
Views: 796
Reputation: 330353
The most common cause of behavior like this one on non-degenerate dataset is data leakage. Data leakage can take different forms, but considering
that one feature (#24 in this case) in my DataFrame contributed 100% weight
we can significantly narrow things down:
Upvotes: 1