Reputation: 123
I am training GBC. It is multi class classifier with 12 classes of outputs. My issue is I am not getting 100% accuracy when i predict on the train data. In fact, misprediction happens on dominant set of classes. (my input is imbanalanced and i do synthetic data creation.)
Here are details: Input data size: Input shape: (20744, 13) (doing label encoding and minmax scaling on output and input)
Distribution before scaling of data:
[(0, 443), **(1, 6878),** (2, 177), (3, 1255), (4, 311), (5, 172), (6, 1029), (7, 268), (8, 131), (9, 54), (10, 1159), (11, 340), (12, 1370),
**(13, 7157)**]
Oversampling with random oversampler
[(0, 7157), (1, 7157), (2, 7157), (3, 7157), (4, 7157), (5, 7157), (6, 7157), (7, 7157), (8, 7157), (9, 7157), (10, 7157), (11, 7157), (12, 7157), (13, 7157)]
final shapes after preprocessing:
Input shape X: (100198, 12)
Target Shape Y: (100198, 1)
Model: est = GradientBoostingClassifier(verbose=3, n_estimators=n_est, learning_rate=0.001, max_depth =24, min_samples_leaf=3, max_features=3)
outputs:
ACC: 0.9632
Feature importance:
[0.09169515 0.01167983 0. 0. 0.11126567 0.14089752
0.12381927 0.10735138 0.1344401 0.13874134 0.08111774 0.058992 ]
Accuracy score on Test data: 19303
[[1406 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ 19 1024 4 32 4 5 24 5 0 0 24 8 48 211]
[ 0 0 1434 0 0 0 0 0 0 0 0 0 0 0]
[ 1 8 0 1423 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 1441 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 1430 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 1439 0 0 0 3 0 0 1]
[ 0 0 0 0 0 0 0 1453 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 1432 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 1445 0 0 0 0]
[ 0 2 0 0 0 0 0 0 0 0 1398 0 0 1]
[ 0 0 0 0 0 0 0 0 0 0 0 1411 0 0]
[ 0 5 0 1 0 0 0 0 0 0 0 0 1413 6]
[ 1 154 9 22 12 6 22 6 3 8 17 20 45 1154]]
Precision on Test data: 0.9632235528942116
**The problem I see is when i predict on train data: I expect a 100% prediction. But somehow my dominant classes are not 100% predicted. Any reason?
ACC: 0.9982**
Accuracy score on Train data: 80016
[[5751 0 0 0 0 0 0 0 0 0 0 0 0 0]
[ **0 5699 2 2 1 0 1 3 3 2 0 2 2 32**]
[ 0 0 5723 0 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 5725 0 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 5716 0 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 5727 0 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 5714 0 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 5704 0 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 5725 0 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 5712 0 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 5756 0 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 5746 0 0]
[ 0 0 0 0 0 0 0 0 0 0 0 1 5731 0]
[ **0 4 5 5 5 2 9 8 2 16 6 19 10 5587**]]
Precision on Train data: 0.9982284987150378 Recall on Train data: 0.9982284987150378
Any idea as what's going wrong?
Upvotes: -1
Views: 815
Reputation: 176
Firstly, you should NOT apply minmax()
standardisation, or any standardisation for that matter, on multi-class label column. Apply standardisation on features matrix only. In a classification problem, the label has to be treated as discrete, categorical entity (even encoding the label classes into ordinal numbers is optional, at least in sklearn
).
Secondly, why do you expect 100% classification accuracy for training set? Are you implying accuracy with training set should always be 100% or there is something special about your model which makes you expect 100% accuracy? A well-generalised good model is where difference between training and test accuracy is very small, if any. Of course, ideally, both train and test classification should be close to 100% but that is extremely rare. A 100% accuracy with only train set is not a measure of a good model.
Upvotes: 2