Micha Schopman
Micha Schopman

Reputation: 221

Identifying accuracy and dropped features with AutoML (ml.net)

I have been playing with ML.Net AutoML and having a blast with it. I still have some questions and hope someone either could help or guide me in the right direction with some of my questions.

Question 1: I have a trained binary classification model from AutoML. This resulted in a top 5 list of algorithms based on highest accuracy, and I ended up with a SdcaLogisticRegressionBinary binary classification model with an accuracy of 89%.

Now when I do my evaluation the accuracy drops to 84%. Would this mean the original training model was overfitted by 5%? Would it be fair to say that the accuracy of my model is not 89% but actually 84% based on the evaluation?

Question 2: AutoML also drops features during training where needed. Is there a way to retrieve the actual list of features that was included in the final model, e.g. determine which features were dropped and didn't improve the accuracy of the model?

When I inspect the final model, the OutputSchema tends to always include all the features based on the initial training data.

Upvotes: 0

Views: 345

Answers (1)

desertnaut
desertnaut

Reputation: 60370

Would this mean the original training model was overfitted by 5%?

This terminology says nothing, and it is never used. Sadly, "overfitting" is a much abused term nowadays, used to mean almost everything linked to suboptimal performance; nevertheless, and practically speaking, overfitting means something very specific: its telltale signature is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:

enter image description here

The 5% "margin" between your training and validation accuracy is another story altogether (it is called generalization gap), and does not signify overfitting.

Would it be fair to say that the accuracy of my model is not 89% but actually 84% based on the evaluation?

As you have already probably suspected, "accuracy" by itself is an ambiguous term; truth is that, in practice, when used without any other signifier, it it usually taken to mean the validation accuracy (practically nobody bothers for the exact value of the training accuracy). In any case, the correct report of your results would be - training accuracy 89%, validation accuracy 85%.

Upvotes: 1

Related Questions