Reputation: 23
I'm working on a script that will predict the used disk space % on a server given a future date. The Use% is grabbed 1x/day from this command as in the below:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 30G 24G 4.4G 85% /
and recorded along with the date. The script is in Python and the short of it is im getting a very low score when I use LinearRegression as my classifier. The code is below:
import pandas as pd
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
# list of tuples whose format is (day_of_month, percent_used)
results = [(1, 83), (2, 87), (3, 87), (4, 87), (5, 89), (6, 88), (7, 83), (8, 75), (9, 73), (10, 73), (11, 74), (12, 77), (13, 77), (14, 79), (15, 79), (16, 79), (17, 79), (18, 79), (19, 80), (21, 80), (22, 81), (23, 84), (24, 85), (25, 85), (26, 85), (27, 85), (28, 85)]
labels = ['day', 'pct_used']
df = pd.DataFrame.from_records(results, columns=labels)
# convert list of days into a numpy array and reshape
X = np.array(df['day']).reshape(-1,1)
y = np.array(df['pct_used'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4)
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
print clf.score(X_test, y_test)
# predict day 30's pct_used value
print clf.predict(np.array([30]).reshape(-1,1))
and it outputs:
-0.19521578836110454
[81.22057369]
where the clf.score
is negative each time. I would like to get it positive and at least .95 or higher so I can be confident in the prediction. I'm not sure if Im using the wrong classifier, needing more data, needing more features, or doing something else wrong in the code.
Something interesting I've found is if I change the initial list of results to have more linearly increasing pct_used, eg:
results = [(1, 73), (2, 73), (3, 74), (4, 75), (5, 76), (6, 77), (7, 78), (8, 78), (9, 80), (10, 80), (11, 81), (12, 82), (13, 83), (14, 84), (15, 85), (16, 85), (17, 85), (18, 86), (19, 86), (21, 87), (22, 88), (23, 89), (24, 89), (25, 90), (26, 91), (27, 91), (28, 92)]
Then the score skyrockets with this output:
0.9852576797564747
[94.37028796]
So that makes me think that LinearRegression
works well as the classifier only as long as the Y axis data is mostly linear. Of course, in the real-world disk space fluctuates like it does in my original dataset, so thats why im thinking maybe I should be using a different classifier but i tried sklearn.svm.SVR()
and that was very poor score as well.
I suppose instead of linear regression a logistic regression approach could work, where either its likely to exceed 90% used in the next few days or its not. Or i read briefly about 'time-series forecasting' though im not sure if this meets the criteria or not (i'm new to machine learning..). I'm flexible, just really questioning what is wrong with my setup and if I need to take a new approach altogether.
Thank you for any suggestions and specific edits to improve the code.
Upvotes: 1
Views: 2133
Reputation: 1704
When you use a linear regression. You are actually just fitting a line to the data. If the data is not linear, then it is not a great method. Notice that your data is not linear with respect to day:
Fitting a line (i.e. doing LinearRegression) over your data gives you a line that is not a great predictor of your data:
There are however subsets of your data that are more linear. For example, if you use a linear regression from day 8 on, then you get the following line:
Your "score" goes way up. Running your code 1000 times (on this subset of the data) gives an average score of .875857. You may want to come up with a model that accounts for the fact that at a certain percent, a user will probably delete files in order to free up more space.
Upvotes: 6