scikit-learn regression prediction results are too good. What did I mess up?

Question

We have some ML models running in Azure on top of the Azure ML Studio platform (the initial drag & drop system). All has been good for over a year but we need to move on so we can scale better. So I'm working on rewriting these in Python using scikit-learn and testing them in an Jupyter notebook.

The good news/bad news is that our data to train on is fairly small (several hundred records in a database). It's very imperfect data making very imperfect regression predictions, so error is to be expected. And that's fine. And for this question, it's good. Because the problem is that, when I test these models, the predictions are way too perfect. I don't understand what I'm doing wrong, but I'm clearly doing something wrong.

The obvious things to suspect (in my mind) are that either I'm training on the test data or there's an obvious/perfect causation found via the correlations. My use of train_test_split tells me that I'm not training on my test data and I guarantee the second is false because of how messy this space is (we started doing manual linear regression on this data about 15 years ago, and still maintain Excel spreadsheets to be able to manually do it in a pinch, even if it's significantly less accurate than our Azure ML Studio models).

Let's look at the code. Here's the relevant portion of my Jupyter notebook (sorry if there's a better way to format this):

X = myData
y = myData.ValueToPredict
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.75,
    test_size = 0.25)
print("X_train: ", X_train.shape)
print("y_train: ", y_train.shape)
print("X_test:  ", X_test.shape)
print("y_test:  ", y_test.shape)

X_train: (300, 17)

y_train: (300,)

X_test: (101, 17)

y_test: (101,)

ESTIMATORS = {
    "Extra Trees": ExtraTreesRegressor(criterion = "mse",
                                       n_estimators=10,
                                       max_features=16,
                                       random_state=42),
    "Decision Tree": DecisionTreeRegressor(criterion = "mse",
                                  splitter = "best",
                                       random_state=42),
    "Random Forest": RandomForestRegressor(criterion = "mse",
                                       random_state=42),
    "Linear regression": LinearRegression(),
    "Ridge": RidgeCV(),
}

y_test_predict = dict()
y_test_rmse = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)
    y_test_rmse[name] = np.sqrt(np.mean((y_test - y_test_predict[name]) ** 2)) # I think this might be wrong but isn't the source of my problem
for name, error in y_test_rmse.items():
    print(name + " RMSE: " + str(error))

Extra Trees RMSE: 0.3843540838630157

Decision Tree RMSE: 0.32838969545222946

Random Forest RMSE: 0.4304701784728594

Linear regression RMSE: 7.971345895791494e-15

Ridge RMSE: 0.0001390197344951183

y_test_score = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)
    y_test_score[name] = estimator.score(X_test, y_test)
for name, error in y_test_score.items():
    print(name + " Score: " + str(error))

Extra Trees Score: 0.9990166492769291

Decision Tree Score: 0.999282165241745

Random Forest Score: 0.998766521504593

Linear regression Score: 1.0

Ridge Score: 0.9999999998713534

I thought maybe I was doing the error metrics wrong, so I just looked at simple scores (which is why I included both). However, both show that these predictions are too good to be true. Keep in mind that the volume of inputs are small (~400 items in total?). And the data this is running over is essentially making predictions of a commodity consumption based on weather patterns, which is kind of a messy space to begin with, so lots of error should be present.

What am I doing wrong here?

(Also, if I can ask this question in a better way or provide more useful information, I'd greatly appreciate it!)

Here is a heatmap of the data. I indicated the value we're predicting.

I also plotted a couple of those more important inputs vs the value we're predicting (color-coded by yet another dimension):

Here's column #2, as asked about in comments

Solution!

As pointed out by @jwil, I wasn't pulling my ValueToPredict column out of my X variable. The solution was a one-liner added to remove that column:

X = myData
y = myData.ValueToPredict
X = X.drop("ValueToPredict", 1) # <--- ONE-LINE FIX!
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    train_size = 0.75,
    test_size = 0.25)

With this in place, my error & scores are much more where I expect:

Extra Trees RMSE: 1.6170428819849574

Decision Tree RMSE: 1.990459810552763

Random Forest RMSE: 1.699801032532343

Linear regression RMSE: 2.5265108241534397

Ridge RMSE: 2.528721533965162

Extra Trees Score: 0.9825944193611161

Decision Tree Score: 0.9736274412836977

Random Forest Score: 0.9807672396970707

Linear regression Score: 0.9575098985510281

Ridge Score: 0.9574355079097321

jwil · Accepted Answer

You're right; I strongly suspect that you have one or more features in your X data that is nearly perfectly correlated with the Y data. Usually this is bad, because those variables don't explain Y but are either explained by Y or jointly determined with Y. To troubleshoot this, consider performing a linear regression of Y on X and then using simple p values or AIC/BIC to determine which X variables are the least relevant. Drop these and repeat the process until your R^2 begins to drop seriously (though it will drop a little each time). The remaining variables will be the most relevant in prediction, and hopefully you'll be able to identify from that subset which variables are so tightly correlated with Y.

scikit-learn regression prediction results are too good. What did I mess up?

Solution!

Answers (1)

Related Questions