Reputation: 1496
I am working on a Kaggle competition (data here), and I am having trouble using scikit-learn's GradientBoostingRegressor. The competition is using the root mean log squared error (RMLSE) to evaluate predictions.
For the sake of an MWE, here is the code I used to clean the train.csv
at the link above:
import datetime
import pandas as pd
train = pd.read_csv("train.csv", index_col=0)
train.pickup_datetime = pd.to_datetime(train.pickup_datetime)
train["pickup_month"] = train.pickup_datetime.apply(lambda x: x.month)
train["pickup_day"] = train.pickup_datetime.apply(lambda x: x.day)
train["pickup_hour"] = train.pickup_datetime.apply(lambda x: x.hour)
train["pickup_minute"] = train.pickup_datetime.apply(lambda x: x.minute)
train["pickup_weekday"] = train.pickup_datetime.apply(lambda x: x.weekday())
train = train.drop(["pickup_datetime", "dropoff_datetime"], axis=1)
train["store_and_fwd_flag"] = pd.get_dummies(train.store_and_fwd_flag, drop_first=True)
X_train = train.drop("trip_duration", axis=1)
y_train = train.trip_duration
To illustrate something that works, if I use a random forest, then the RMSLE is computed just fine:
import numpy as np
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score
def rmsle(predicted, real):
sum=0.0
for x in range(len(predicted)):
p = np.log(predicted[x]+1)
r = np.log(real[x]+1)
sum = sum + (p - r)**2
return (sum/len(predicted))**0.5
rmsle_score = make_scorer(rmsle, greater_is_better=False)
rf = RandomForestRegressor(random_state=1839, n_jobs=-1, verbose=2)
rf_scores = cross_val_score(rf, X_train, y_train, cv=3, scoring=rmsle_score)
print(np.mean(rf_scores))
This runs just fine. However, the gradient boosting regressor throws RuntimeWarning: invalid value encountered in log
, and I get a nan
from the print
statement. Looking at the array of three RMSLE scores, they are all nan
.
gb = GradientBoostingRegressor(verbose=2)
gbr_scores = cross_val_score(gb, X_train, y_train, cv=3, scoring=rmsle_score)
print(np.mean(gbr_scores))
I assume this is because I'm getting a negative value at some place where I shouldn't be. Kaggle told me it was encountering zero or non-negative RMSLE, as well, when I uploaded my predictions there to see if it was something about my code. Is there a reason why gradient boosting cannot be used for this problem? If I use mean_squared_error
as the scorer (mse_score = make_scorer(mean_squared_error, greater_is_better=False)
), it returns that just fine.
I'm sure I'm missing something simple about gradient boosting; why is this scoring method not working for the gradient boosting regressor?
Upvotes: 2
Views: 12618
Reputation: 2426
I would suggest you to vectorize this
def rmsle(y, y0):
return np.sqrt(np.mean(np.square(np.log1p(y) - np.log1p(y0))))
Benchmarks can be found here
https://www.kaggle.com/jpopham91/rmlse-vectorized
Upvotes: 8
Reputation: 8811
First the syntax that make_scorer takes for your function is of the following form:
def metric(real,predictions)
Not
def metric(predictions,real)
So you will need to print real
values in your code to get the actual predicted
values for your regressor.
Just change the function as follows and it should work correctly:
def rmsle(real, predicted):
sum=0.0
for x in range(len(predicted)):
if predicted[x]<0 or real[x]<0: #check for negative values
continue
p = np.log(predicted[x]+1)
r = np.log(real[x]+1)
sum = sum + (p - r)**2
return (sum/len(predicted))**0.5
Secondly, You regressor is giving a wrong value at predcition for row no. 399937 in the first cross-validated set. Hope this helps ! All the best for your competition.
Upvotes: 4