Reputation: 33
I am using Python v3.7 and xgboost v0.81. I have continuous data (y) at a US state level by each week from 2015 to 2019. I'm trying to regress on the following features to y: year, month, week, region (encoded). I've set the train as August 2018 and before and the test is September 2018 and onward. When I train the model this way, two weird things happen:
Fixing any of the features to a single variable allows the model to train appropriately and the two weird issues encountered previously are gone. Ex. year==2017 or region==28
X = df[['year', 'month', 'week', 'region_encoded']]
display(X)
y = df.target
display(y)
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.1)
model = XGBRegressor(n_jobs=-1, n_estimators=1000).fit(X_train, y_train)
display(model.predict(X_test)[:20])
display(model.feature_importances_)
year month week region_encoded
0 2015 10 40 0
1 2015 10 40 1
2 2015 10 40 2
3 2015 10 40 3
4 2015 10 40 4
0 272.0
1 10.0
2 290.0
3 46.0
4 558.0
Name: target, dtype: float64
array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5], dtype=float32)
array([nan, nan, nan, nan], dtype=float32)
Upvotes: 3
Views: 7060
Reputation: 3725
If the target variable has NaN
in it, even just one, that is enough for many machine learning algorithms to break. This is usually because when an unhandled NaN
is present in the target variable in the update step of many ML algorithms for example computing derivatives, the NaN
propagates. Although, I cannot say too much about which step in XGBoost does this.
For example, the analytical solution for linear regression.
import numpy as np
import numpy.linalg as la
from scipy import stats
y = np.array([0, 1, 2, 3, np.nan, 5, 6, 7, 8, 9])
x = stats.norm().rvs((len(y), 3))
# Main effects estimate
m_hat = la.inv(x.T @ x) @ x.T @ y
>>> [nan nan nan]
Upvotes: 4