OverthinkingLab
OverthinkingLab

Reputation: 23

Python scikit-learn: Why is my LinearRegression classifier's score so low?

I'm working on a script that will predict the used disk space % on a server given a future date. The Use% is grabbed 1x/day from this command as in the below:

$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda3              30G   24G  4.4G  85% /

and recorded along with the date. The script is in Python and the short of it is im getting a very low score when I use LinearRegression as my classifier. The code is below:

import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

# list of tuples whose format is (day_of_month, percent_used)
results =  [(1, 83), (2, 87), (3, 87), (4, 87), (5, 89), (6, 88), (7, 83), (8, 75), (9, 73), (10, 73), (11, 74), (12, 77), (13, 77), (14, 79), (15, 79), (16, 79), (17, 79), (18, 79), (19, 80), (21, 80), (22, 81), (23, 84), (24, 85), (25, 85), (26, 85), (27, 85), (28, 85)]

labels = ['day', 'pct_used']
df = pd.DataFrame.from_records(results, columns=labels)

# convert list of days into a numpy array and reshape
X = np.array(df['day']).reshape(-1,1)

y = np.array(df['pct_used'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4)
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
print clf.score(X_test, y_test)
# predict day 30's pct_used value
print clf.predict(np.array([30]).reshape(-1,1))

and it outputs:

-0.19521578836110454
[81.22057369]

where the clf.score is negative each time. I would like to get it positive and at least .95 or higher so I can be confident in the prediction. I'm not sure if Im using the wrong classifier, needing more data, needing more features, or doing something else wrong in the code.

Something interesting I've found is if I change the initial list of results to have more linearly increasing pct_used, eg:

results = [(1, 73), (2, 73), (3, 74), (4, 75), (5, 76), (6, 77), (7, 78), (8, 78), (9, 80), (10, 80), (11, 81), (12, 82), (13, 83), (14, 84), (15, 85), (16, 85), (17, 85), (18, 86), (19, 86), (21, 87), (22, 88), (23, 89), (24, 89), (25, 90), (26, 91), (27, 91), (28, 92)]

Then the score skyrockets with this output:

0.9852576797564747
[94.37028796]

So that makes me think that LinearRegression works well as the classifier only as long as the Y axis data is mostly linear. Of course, in the real-world disk space fluctuates like it does in my original dataset, so thats why im thinking maybe I should be using a different classifier but i tried sklearn.svm.SVR() and that was very poor score as well.

I suppose instead of linear regression a logistic regression approach could work, where either its likely to exceed 90% used in the next few days or its not. Or i read briefly about 'time-series forecasting' though im not sure if this meets the criteria or not (i'm new to machine learning..). I'm flexible, just really questioning what is wrong with my setup and if I need to take a new approach altogether.

Thank you for any suggestions and specific edits to improve the code.

Upvotes: 1

Views: 2133

Answers (1)

Joe Patten
Joe Patten

Reputation: 1704

When you use a linear regression. You are actually just fitting a line to the data. If the data is not linear, then it is not a great method. Notice that your data is not linear with respect to day:

enter image description here

Fitting a line (i.e. doing LinearRegression) over your data gives you a line that is not a great predictor of your data:

enter image description here

There are however subsets of your data that are more linear. For example, if you use a linear regression from day 8 on, then you get the following line:

enter image description here

Your "score" goes way up. Running your code 1000 times (on this subset of the data) gives an average score of .875857. You may want to come up with a model that accounts for the fact that at a certain percent, a user will probably delete files in order to free up more space.

Upvotes: 6

Related Questions