user2463426
user2463426

Reputation: 7

Statistical comparison of machine learning algorithm

I am working in machine learning. I am stuck in one of the thing.

I want to compare 4 machine learning techniques among 10 datasets. After performing experiment i got Area Under Curve value. After this i have applied Analysis of variance test which shows there is a significant difference between 4 machine learning techniques.

Now my problem is that which test will conclude that particular algorithm perform well compared to other algorithm and i want only one winner among the machine learning techniques.

Upvotes: 0

Views: 528

Answers (2)

invoketheshell
invoketheshell

Reputation: 3897

If you are gathering performance metrics (ROC,accuracy,sensitivity,specificity...) from identicially resampled data sets then you can perform statistical tests using paired comparisons. Most statistical software impliment Tukeys Range test (ANOVA). https://en.wikipedia.org/wiki/Tukey%27s_range_test. A formal treatment of this material is here: http://epub.ub.uni-muenchen.de/4134/1/tr030.pdf. This is the test I like to use for the purpose you discuss, although there are others and people have varying opinions.

You will still have to choose how you will sample based on your data (k-fold), repeated (k-fold), bootstrap, leave one out, repeated training test splits. Bootstrap methods tend to give you the tightest confidence intervals after leave one out; but leave one out might not be an option if your data is huge.

That being said you may also need to consider the problem domain. False positives may be an issue in classification. You may need to consider other metrics to choose the best performer for the domain. AUC might not always be the best model for a specific domain. For instance a credit card company may not want to deny a transaction to customers, we need a very low false positive on fraud classification.

You may also want to consider implementation. If a logistic regression performs near as well it may be a better choice over a more complicated implementation of a random forest. Are there legal implications to model use (Fair Credit Reporting Act...)?

A common sense approach is to begin with something like RF or Gradient boosted trees to get an empirical sense of a performance ceiling. Then build simpler models and use the simpler model that performs reasonabley well compared to the ceiling.

Or you could combine all your models using something like LASSO... or some other model.

Upvotes: 0

runDOSrun
runDOSrun

Reputation: 10995

A classifier's quality can be measured by the F-Score which measures the test's accuracy. Comparing these respective scores will give you a simple measure.

However, if you want to measure whether the difference between the classifiers' accuracies is significant, you can try the Bayesian Test or, if classifiers are trained once, McNemar's test.

There are other possibilities and the papers On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach and Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms are probably worth reading.

Upvotes: 1

Related Questions