Reputation: 13509
We have some ML models running in Azure on top of the Azure ML Studio platform (the initial drag & drop system). All has been good for over a year but we need to move on so we can scale better. So I'm working on rewriting these in Python using scikit-learn and testing them in an Jupyter notebook.
The good news/bad news is that our data to train on is fairly small (several hundred records in a database). It's very imperfect data making very imperfect regression predictions, so error is to be expected. And that's fine. And for this question, it's good. Because the problem is that, when I test these models, the predictions are way too perfect. I don't understand what I'm doing wrong, but I'm clearly doing something wrong.
The obvious things to suspect (in my mind) are that either I'm training on the test data or there's an obvious/perfect causation found via the correlations. My use of train_test_split
tells me that I'm not training on my test data and I guarantee the second is false because of how messy this space is (we started doing manual linear regression on this data about 15 years ago, and still maintain Excel spreadsheets to be able to manually do it in a pinch, even if it's significantly less accurate than our Azure ML Studio models).
Let's look at the code. Here's the relevant portion of my Jupyter notebook (sorry if there's a better way to format this):
X = myData
y = myData.ValueToPredict
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
train_size = 0.75,
test_size = 0.25)
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test: ", X_test.shape)
print("y_test: ", y_test.shape)
X_train: (300, 17)
y_train: (300,)
X_test: (101, 17)
y_test: (101,)
ESTIMATORS = {
"Extra Trees": ExtraTreesRegressor(criterion = "mse",
n_estimators=10,
max_features=16,
random_state=42),
"Decision Tree": DecisionTreeRegressor(criterion = "mse",
splitter = "best",
random_state=42),
"Random Forest": RandomForestRegressor(criterion = "mse",
random_state=42),
"Linear regression": LinearRegression(),
"Ridge": RidgeCV(),
}
y_test_predict = dict()
y_test_rmse = dict()
for name, estimator in ESTIMATORS.items():
estimator.fit(X_train, y_train)
y_test_predict[name] = estimator.predict(X_test)
y_test_rmse[name] = np.sqrt(np.mean((y_test - y_test_predict[name]) ** 2)) # I think this might be wrong but isn't the source of my problem
for name, error in y_test_rmse.items():
print(name + " RMSE: " + str(error))
Extra Trees RMSE: 0.3843540838630157
Decision Tree RMSE: 0.32838969545222946
Random Forest RMSE: 0.4304701784728594
Linear regression RMSE: 7.971345895791494e-15
Ridge RMSE: 0.0001390197344951183
y_test_score = dict()
for name, estimator in ESTIMATORS.items():
estimator.fit(X_train, y_train)
y_test_predict[name] = estimator.predict(X_test)
y_test_score[name] = estimator.score(X_test, y_test)
for name, error in y_test_score.items():
print(name + " Score: " + str(error))
Extra Trees Score: 0.9990166492769291
Decision Tree Score: 0.999282165241745
Random Forest Score: 0.998766521504593
Linear regression Score: 1.0
Ridge Score: 0.9999999998713534
I thought maybe I was doing the error metrics wrong, so I just looked at simple scores (which is why I included both). However, both show that these predictions are too good to be true. Keep in mind that the volume of inputs are small (~400 items in total?). And the data this is running over is essentially making predictions of a commodity consumption based on weather patterns, which is kind of a messy space to begin with, so lots of error should be present.
What am I doing wrong here?
(Also, if I can ask this question in a better way or provide more useful information, I'd greatly appreciate it!)
Here is a heatmap of the data. I indicated the value we're predicting.
I also plotted a couple of those more important inputs vs the value we're predicting (color-coded by yet another dimension):
Here's column #2, as asked about in comments
As pointed out by @jwil, I wasn't pulling my ValueToPredict
column out of my X
variable. The solution was a one-liner added to remove that column:
X = myData
y = myData.ValueToPredict
X = X.drop("ValueToPredict", 1) # <--- ONE-LINE FIX!
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
train_size = 0.75,
test_size = 0.25)
With this in place, my error & scores are much more where I expect:
Extra Trees RMSE: 1.6170428819849574
Decision Tree RMSE: 1.990459810552763
Random Forest RMSE: 1.699801032532343
Linear regression RMSE: 2.5265108241534397
Ridge RMSE: 2.528721533965162
Extra Trees Score: 0.9825944193611161
Decision Tree Score: 0.9736274412836977
Random Forest Score: 0.9807672396970707
Linear regression Score: 0.9575098985510281
Ridge Score: 0.9574355079097321
Upvotes: 2
Views: 2238
Reputation: 553
You're right; I strongly suspect that you have one or more features in your X data that is nearly perfectly correlated with the Y data. Usually this is bad, because those variables don't explain Y but are either explained by Y or jointly determined with Y. To troubleshoot this, consider performing a linear regression of Y on X and then using simple p values or AIC/BIC to determine which X variables are the least relevant. Drop these and repeat the process until your R^2 begins to drop seriously (though it will drop a little each time). The remaining variables will be the most relevant in prediction, and hopefully you'll be able to identify from that subset which variables are so tightly correlated with Y.
Upvotes: 1