Root mean log squared error issue with scitkit-learn.ensemble.GradientBoostingRegressor

Question

I am working on a Kaggle competition (data here), and I am having trouble using scikit-learn's GradientBoostingRegressor. The competition is using the root mean log squared error (RMLSE) to evaluate predictions.

For the sake of an MWE, here is the code I used to clean the train.csv at the link above:

import datetime
import pandas as pd

train = pd.read_csv("train.csv", index_col=0)

train.pickup_datetime = pd.to_datetime(train.pickup_datetime)
train["pickup_month"] = train.pickup_datetime.apply(lambda x: x.month)
train["pickup_day"] = train.pickup_datetime.apply(lambda x: x.day)
train["pickup_hour"] = train.pickup_datetime.apply(lambda x: x.hour)
train["pickup_minute"] = train.pickup_datetime.apply(lambda x: x.minute)
train["pickup_weekday"] = train.pickup_datetime.apply(lambda x: x.weekday())
train = train.drop(["pickup_datetime", "dropoff_datetime"], axis=1)
train["store_and_fwd_flag"] = pd.get_dummies(train.store_and_fwd_flag, drop_first=True)

X_train = train.drop("trip_duration", axis=1)
y_train = train.trip_duration

To illustrate something that works, if I use a random forest, then the RMSLE is computed just fine:

import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score


def rmsle(predicted, real):
    sum=0.0
    for x in range(len(predicted)):
        p = np.log(predicted[x]+1)
        r = np.log(real[x]+1)
        sum = sum + (p - r)**2
    return (sum/len(predicted))**0.5

rmsle_score = make_scorer(rmsle, greater_is_better=False)

rf = RandomForestRegressor(random_state=1839, n_jobs=-1, verbose=2)
rf_scores = cross_val_score(rf, X_train, y_train, cv=3, scoring=rmsle_score)
print(np.mean(rf_scores))

This runs just fine. However, the gradient boosting regressor throws RuntimeWarning: invalid value encountered in log, and I get a nan from the print statement. Looking at the array of three RMSLE scores, they are all nan.

gb = GradientBoostingRegressor(verbose=2)
gbr_scores = cross_val_score(gb, X_train, y_train, cv=3, scoring=rmsle_score)
print(np.mean(gbr_scores))

I assume this is because I'm getting a negative value at some place where I shouldn't be. Kaggle told me it was encountering zero or non-negative RMSLE, as well, when I uploaded my predictions there to see if it was something about my code. Is there a reason why gradient boosting cannot be used for this problem? If I use mean_squared_error as the scorer (mse_score = make_scorer(mean_squared_error, greater_is_better=False)), it returns that just fine.

I'm sure I'm missing something simple about gradient boosting; why is this scoring method not working for the gradient boosting regressor?

Gambit1614 · Accepted Answer

First the syntax that make_scorer takes for your function is of the following form:

def metric(real,predictions)

Not

def metric(predictions,real)

So you will need to print real values in your code to get the actual predicted values for your regressor.

Just change the function as follows and it should work correctly:

def rmsle(real, predicted):
    sum=0.0
    for x in range(len(predicted)):
        if predicted[x]<0 or real[x]<0: #check for negative values
            continue
        p = np.log(predicted[x]+1)
        r = np.log(real[x]+1)
        sum = sum + (p - r)**2
    return (sum/len(predicted))**0.5

Secondly, You regressor is giving a wrong value at predcition for row no. 399937 in the first cross-validated set. Hope this helps ! All the best for your competition.

Root mean log squared error issue with scitkit-learn.ensemble.GradientBoostingRegressor

Answers (2)

Related Questions