Moveton
Moveton

Reputation: 255

Python - machine learning

currently I am trying to understand the way machine learning algorithms work and one thing I don't really get is the obvious difference between calculated accuracy of predicted labels and the visual confusion matrix. I will try to explain as clear as it is possible.

Here is the snippet of the dataset (here you can see 9 samples (about 4k in real dataset), 6 features and 9 labels (which stand for not numbers, but some meanings and cannot be compared like 7 > 4 > 1)):

f1      f2      f3      f4      f5    f6   label
89.18   0.412   9.1     24.17   2.4   1    1
90.1    0.519   14.3    16.555  3.2   1    2
83.42   0.537   13.3    14.93   3.4   1    3
64.82   0.68    9.1     8.97    4.5   2    4
34.53   0.703   4.9     8.22    3.5   2    5
87.19   1.045   4.7     5.32    5.4   2    6
43.23   0.699   14.9    12.375  4.0   2    7
43.29   0.702   7.3     6.705   4.0   2    8
20.498  1.505   1.321   6.4785  3.8   2    9

In favor of curiosity I tried a number of algorithms (Linear, Gaussian, SVM (SVC, SVR), Bayesian etc.). As far as I understood the manual, in my case it is better to work with classifiers (discrete), rather than regression (continuous). Using common:

model.fit(X_train, y_train) 
model.score(X_test, y_test)

I got:

Lin_Reg: 0.855793988736
Log_Reg: 0.463251670379
DTC:     0.400890868597
KNC:     0.41425389755
LDA:     0.550111358575
Gaus_NB: 0.391982182628
Bay_Rid: 0.855698151574
SVC:     0.483296213808
SVR:     0.647914795849

Continuous algorithms did better results. When I used confusion matrix for Bayesian Ridge (had to convert float to integers) to verify its result, I got the following:

Pred  l1   l2   l3   l4   l5   l6   l7   l8   l9
True
l1    23,  66,  0,   0,   0,   0,   0,   0,   0
l2    31,  57   1,   0,   0,   0,   0,   0,   0
l3    13,  85,  19   0,   0,   0,   0,   0,   0
l4    0,   0,   0,   0    1,   6,   0,   0,   0
l5    0,   0,   0,   4,   8    7,   0,   0,   0
l6    0,   0,   0,   1,   27,  36   7,   0,   0
l7    0,   0,   0,   0,   2,   15,  0    0,   0
l8    0,   0,   0,   1,   1,   30,  8,   0    0
l9    0,   0,   0,   1,   0,   9,   1,   0,   0

What gave me an understanding that 85% accuracy is wrong. How can this be explained? Is this because float/int conversion?

Would be thankful for any direct answer/link etc.

Upvotes: 7

Views: 602

Answers (3)

Vincent J. Michuki
Vincent J. Michuki

Reputation: 549

Take a look at this.
Use "model.score(X_test, y_test)".

Upvotes: 1

Lukasz Tracewski
Lukasz Tracewski

Reputation: 11377

You are mixing here two very distinct concepts of machine learning: regression and classification. Regression typically deals with continuous values, e.g. temperature or stock market value. Classification on the other hand can tell you which bird species is in the recording - that's exactly where you would use a confusion matrix. It would tell you how many times the algorithm correctly predicted the label and where it made mistakes. SciPy, which you are using, has separate sections for both.

Both for classification and regression problems you can use different metrics for scoring them, so never assume they are comparable. As @javad pointed out, the 'coefficient of determination', is very different than accuracy. I would also recommend reading on precision and recall.

In your case you clearly have a classification problem and as such it should be treated. Also, mind that f6 looks like it has a discrete set of values.

If you'd like quickly experiment with different approaches I can recommend e.g. H2O, which, next to nice API, has great user interface and allows for massive parallel processing. XGBoost is also excellent.

Upvotes: 5

javad
javad

Reputation: 528

Take a look at the documentation here.

If you call score() on regression methods they will return the 'coefficient of determination R^2 of the prediction' not the accuracy.

Upvotes: 4

Related Questions