Toutsos
Toutsos

Reputation: 369

Negative cross_val_score with decision_tree_regressor model

I am evaluating a desicion_tree_regressor prediction model with cross_val_score method. Problem is, score seems to be negative and i really dont undestand why.

This is my code:

all_depths = []
all_mean_scores = []
for max_depth in range(1, 11):
    all_depths.append(max_depth)
    simple_tree = DecisionTreeRegressor(max_depth=max_depth)
    cv = KFold(n_splits=2, shuffle=True, random_state=13)
    scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)
    mean_score = np.mean(scores)
    all_mean_scores.append(np.mean(scores))
    print("max_depth = ", max_depth, scores, mean_score, sem(scores))

The result:

max_depth =  1 [-0.45596988 -0.10215719] -0.2790635315340 0.176906344162 
max_depth =  2 [-0.5532268 -0.0186984] -0.285962600541 0.267264196259 
max_depth =  3 [-0.50359311  0.31992411] -0.0918345038141 0.411758610421 max_depth =  4 [-0.57305355  0.21154193] -0.180755811466 0.392297741456 max_depth =  5 [-0.58994928  0.21180425] -0.189072515181 0.400876761509 max_depth =  6 [-0.71730634  0.22139877] -0.247953784441 0.469352551213 max_depth =  7 [-0.60118621  0.22139877] -0.189893720551 0.411292487323 max_depth =  8 [-0.69635044  0.13976584] -0.278292298411 0.418058142228 max_depth =  9 [-0.78917478  0.30970763] -0.239733577455 0.549441204178 max_depth =  10 [-0.76098227  0.34512503] -0.207928623044 0.553053649792

My questions are as follows:

1) Score returns MSE right? If so, how come it is negative?

2) I have a small sample of ~40 observations and ~70 variables. Might this be the problem?

Thanks in advance.

Upvotes: 4

Views: 7064

Answers (2)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 95872

TL,DR:

1) No, not unless you specify that explicitly, or it is the default .score method of the estimator. Since you did not, it defaulted to DecisionTreeRegressor.score which returns the coefficient of determination, i.e. R^2. Which can be negative.

2) Yes, that is a problem. And it explains why you are getting a negative coefficient of determination.

The details:

You've used the function like this:

scores = cross_val_score(simple_tree, df.loc[:,'system':'gwno'], df['gdp_growth'], cv=cv)

So you didn't explicitly pass a "scoring" parameter. Let's look at the docs:

scoring : string, callable or None, optional, default: None

A string (see model evaluation documentation) or a scorer callable object / function with signature scorer(estimator, X, y).

So it doesn't explicitely state this, but this likely means that it uses the default .score method of your estimator.

To confirm that hypothesis, let's dig through the source code. We see that the scorer that is ultimately used is the following:

scorer = check_scoring(estimator, scoring=scoring)

So, let's see the source for check_scoring

has_scoring = scoring is not None
if not hasattr(estimator, 'fit'):
    raise TypeError("estimator should be an estimator implementing "
                    "'fit' method, %r was passed" % estimator)
if isinstance(scoring, six.string_types):
    return get_scorer(scoring)
elif has_scoring:
    # Heuristic to ensure user has not passed a metric
    module = getattr(scoring, '__module__', None)
    if hasattr(module, 'startswith') and \
       module.startswith('sklearn.metrics.') and \
       not module.startswith('sklearn.metrics.scorer') and \
       not module.startswith('sklearn.metrics.tests.'):
        raise ValueError('scoring value %r looks like it is a metric '
                         'function rather than a scorer. A scorer should '
                         'require an estimator as its first parameter. '
                         'Please use `make_scorer` to convert a metric '
                         'to a scorer.' % scoring)
    return get_scorer(scoring)
elif hasattr(estimator, 'score'):
    return _passthrough_scorer
elif allow_none:
    return None
else:
    raise TypeError(
        "If no scoring is specified, the estimator passed should "
        "have a 'score' method. The estimator %r does not." % estimator)

So note, that scoring=None has been carried through, so:

has_scoring = scoring is not None

Implies that has_scoring == False. Also, the estimator has a .score attribute, so we go through this branch:

elif hasattr(estimator, 'score'):
    return _passthrough_scorer

Which is simply:

def _passthrough_scorer(estimator, *args, **kwargs):
    """Function that wraps estimator.score"""
    return estimator.score(*args, **kwargs)

So finally, we now know that the scorer is whatever the default score is for your estimator. Let's check the docs for the estimator, which clearly states:

Returns the coefficient of determination R^2 of the prediction.

The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred) ** 2).sum() and v is the residual sum of squares ((y_true - y_true.mean()) ** 2).sum(). Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0.

So it seems like your scores are actually the coefficient of determination. So, basically, with a negative value for R^2, it means your models are performing very poorly. Worse than if we just predicted the expected value (i.e. the mean) for every input. Which makes sense, since as you state:

I have a small sample of ~40 observations and ~70 variables. Might this be the problem?

It is a problem. It is practically hopeless to get meaningful predictions of a 70-dimensional problem space when you only have 40 observations.

Upvotes: 4

Sriram Sitharaman
Sriram Sitharaman

Reputation: 857

It can happen. Already answered in this post!

The actual MSE is simply the positive version of the number you're getting.

The unified scoring API always maximizes the score, so scores which need to be minimized are negated in order for the unified scoring API to work correctly. The score that is returned is therefore negated when it is a score that should be minimized and left positive if it is a score that should be maximized.

Upvotes: 2

Related Questions