Reputation: 385
Training this model in a for
loop of 200K i could get a precision of 0.97 (this means 97% i guess?), i saved it in a .pickle
file. The problem is that it don't looks like is learning, because i'm getting the same results even without training the model and with a precision of 70-90%. Well, if i got a higher precison, i would think that it is learning, but as i said, the result is not changing.
Anyways, even with precision of 70-97% it is only giving the correct result of ~20-45% of all data. As you can see i'm new to this thing, and i'm following a tutorial at: https://www.youtube.com/watch?v=3AQ_74xrch8
Here is the code:
import pandas as pd
import numpy as np
import pickle
import sklearn
from sklearn import linear_model
data = pd.read_csv('student-mat.csv', sep=';')
data = data[['G1', 'G2', 'G3', 'studytime', 'failures', 'absences']]
predict = 'G3'
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
# comment after train the model #
best_accuracy = 0
array_best_accurary = []
for _ in range(200000):
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.1)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
accuracy = linear.score(x_test, y_test)
if accuracy > best_accuracy:
best_accuracy = accuracy
array_best_accurary.append(best_accuracy)
with open('student_model.pickle', 'wb') as f:
pickle.dump(linear, f)
print(max(array_best_accurary), '\n')
# #
# uncomment after train the model
# picke_in = open('student_model.pickle', 'rb')
# linear = pickle.load(picke_in)
print('Coeficient:\n', linear.coef_)
print('Intercept:\n', linear.intercept_, '\n')
predictions = linear.predict(x_test)
total = len(predictions)
correct_predictions = []
for x in range(total):
print('Predict', predictions[x], '- Correct', y_test[x])
if int(predictions[x]) == y_test[x]:
correct_predictions.append(1)
print('\n')
print('Total:', total)
print('Total correct predicts:', len(correct_predictions))
And the output:
0.977506233512022
Coeficient:
[ 0.14553549 0.98120042 -0.18857019 -0.31539844 0.03324807]
Intercept:
-1.3929098924365348
Predict 9.339230104273398 - Correct 9
Predict -1.7999979510132014 - Correct 0
Predict 18.220125096856393 - Correct 18
Predict 3.5669380684894634 - Correct 0
Predict 8.394034346453692 - Correct 10
Predict 11.17472103817094 - Correct 12
Predict 6.877027043616517 - Correct 7
Predict 13.10046638328761 - Correct 14
Predict 8.460530481589299 - Correct 9
Predict 5.619296478409708 - Correct 9
Predict 5.056861318329287 - Correct 6
Predict -0.4602308511632893 - Correct 0
Predict 5.4907111970972124 - Correct 7
Predict 7.098301508597935 - Correct 0
Predict 9.060702343692888 - Correct 11
Predict 14.906413508421672 - Correct 16
Predict 5.337146104521532 - Correct 7
Predict 6.451206767114973 - Correct 6
Predict 12.005846951225159 - Correct 14
Predict 9.181910373164804 - Correct 0
Predict 7.078728252841696 - Correct 8
Predict 12.944012673326714 - Correct 13
Predict 9.296195408827478 - Correct 10
Predict 9.726422674287734 - Correct 10
Predict 5.872952989811228 - Correct 6
Predict 11.714775970606564 - Correct 12
Predict 10.699461464343582 - Correct 11
Predict 8.079501926145412 - Correct 8
Predict 17.050354493553698 - Correct 17
Predict 11.950269035741151 - Correct 12
Predict 11.907234340295231 - Correct 12
Predict 8.394034346453692 - Correct 8
Predict 9.563804949756388 - Correct 10
Predict 15.08795365845874 - Correct 15
Predict 15.197484489040267 - Correct 14
Predict 9.339230104273398 - Correct 10
Predict 6.72710996076076 - Correct 8
Predict 15.778083095387622 - Correct 16
Predict 8.238497037369088 - Correct 9
Predict 11.357208854852361 - Correct 12
Total: 40
Total correct predicts: 8
I know that it's a float number, but even if i round it up or down, i still don't get the expected result. I know that my code is too simple, but even if i consider a predict that is == (desired predict - 1), in the output above, it would give me 27 correct predictions, which is ~60% of the total. Is not it too low? I would expect something like 70-80%.
My main doubt is why i'm getting ~20-45% of correct results even if the precision is 70-97%. Maybe i misunderstood how it works, could someone clarify?
The dataset i'm using: https://archive.ics.uci.edu/ml/datasets/Student+Performance
Upvotes: -1
Views: 1717
Reputation: 60321
There are several issues with your question.
To start with, in regression settings (such as yours here) we don't use the terms "precision" and "accuracy", which are reserved for classification problems (in which they have very specific meanings and they are far from synonyms).
Having said that, your next step is to clarify for yourself what is your metric, i.e. what exactly is returned by your linear.score(x_test, y_test)
; here, as in many other similar settings, the documentation is your best friend:
score
(self, X, y, sample_weight=None)Returns the coefficient of determination R^2 of the prediction.
So, your metric is the coefficient of determination R^2, or R-squared.
Although an R^2 value of 0.97 sounds pretty good (and it sometimes can be interpreted as 97%, but this does not mean "correct predictions"), the use of the metric in predictive settings, like here, is quite problematic; quoting from my own answer in another SO thread:
the whole R-squared concept comes in fact directly from the world of statistics, where the emphasis is on interpretative models, and it has little use in machine learning contexts, where the emphasis is clearly on predictive models; at least AFAIK, and beyond some very introductory courses, I have never (I mean never...) seen a predictive modeling problem where the R-squared is used for any kind of performance assessment; neither it's an accident that popular machine learning introductions, such as Andrew Ng's Machine Learning at Coursera, do not even bother to mention it. And, as noted in the Github thread above (emphasis added):
In particular when using a test set, it's a bit unclear to me what the R^2 means.
with which I certainly concur.
So, you would be better off using one of the standard metrics for predictive regression problems, such as the Mean Squared Error (MSE) or the Mean Absolute Error (MAE) - the second having the advantage that it is in the same units with your dependent variable; since both these quantities are errors, it means lower-is-better. Have a look at the available regression metrics in scikit-learn and how to use them.
Last but not least, and independently of the discussion above, I cannot see how you have actually arrived at this assessment of your results:
Total: 40
Total correct predicts: 8
since, if we apply the truncation rules (i.e. 15.49 truncates to 15, but 15.51 truncates to 16), I see that roughly half of your predictions are indeed "correct"...
Upvotes: 2
Reputation: 2112
In regression, you don't take accuracy by matching the expected prediction to actual target. This method is used for classification tasks. For regression, you should try to evaluate your model using metrics like MSE, MAE, etc.
Upvotes: 1