Reputation: 3607
I'm having a weird issue with a new installation of xgboost. Under normal circumstances it works fine. However, when I use the model in the following function it gives the error in the title.
The dataset I'm using is borrowed from kaggle, and can be seen here: https://www.kaggle.com/kemical/kickstarter-projects
The function I use to fit my model is the following:
def get_val_scores(model, X, y, return_test_score=False, return_importances=False, random_state=42, randomize=True, cv=5, test_size=0.2, val_size=0.2, use_kfold=False, return_folds=False, stratify=True):
print("Splitting data into training and test sets")
if randomize:
if stratify:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=True, random_state=random_state)
else:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
else:
if stratify:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=False)
else:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
print(f"Shape of training data, X: {X_train.shape}, y: {y_train.shape}. Test, X: {X_test.shape}, y: {y_test.shape}")
if use_kfold:
val_scores = cross_val_score(model, X=X_train, y=y_train, cv=cv)
else:
print("Further splitting training data into validation sets")
if randomize:
if stratify:
X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train, shuffle=True)
else:
X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=True)
else:
if stratify:
print("Warning! You opted to both stratify your training data and to not randomize it. These settings are incompatible with scikit-learn. Stratifying the data, but shuffle is being set to True")
X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train, shuffle=True)
else:
X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=False)
print(f"Shape of training data, X: {X_train_.shape}, y: {y_train_.shape}. Val, X: {X_val.shape}, y: {y_val.shape}")
print("Getting ready to fit model.")
model.fit(X_train_, y_train_)
val_score = model.score(X_val, y_val)
if return_importances:
if hasattr(model, 'steps'):
try:
feats = pd.DataFrame({
'Columns': X.columns,
'Importance': model[-2].feature_importances_
}).sort_values(by='Importance', ascending=False)
except:
model.fit(X_train, y_train)
feats = pd.DataFrame({
'Columns': X.columns,
'Importance': model[-2].feature_importances_
}).sort_values(by='Importance', ascending=False)
else:
try:
feats = pd.DataFrame({
'Columns': X.columns,
'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)
except:
model.fit(X_train, y_train)
feats = pd.DataFrame({
'Columns': X.columns,
'Importance': model.feature_importances_
}).sort_values(by='Importance', ascending=False)
mod_scores = {}
try:
mod_scores['validation_score'] = val_scores.mean()
if return_folds:
mod_scores['fold_scores'] = val_scores
except:
mod_scores['validation_score'] = val_score
if return_test_score:
mod_scores['test_score'] = model.score(X_test, y_test)
if return_importances:
return mod_scores, feats
else:
return mod_scores
The weird part that I'm running into is that if I create a pipeline in sklearn, it works on the dataset outside of the function, but not within it. For example:
from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from xgboost import XGBClassifier
pipe = make_pipeline(OrdinalEncoder(), XGBClassifier())
X = df.drop('state', axis=1)
y = df['state']
In this case, pipe.fit(X, y)
works just fine. But get_val_scores(pipe, X, y)
fails with the error message in the title. What's weirder is that get_val_scores(pipe, X, y)
seems to work with other datasets, like Titanic. The error occurs as the model is fitting on X_train
and y_train
.
In this case the loss function is binary:logistic
, and the state
column has the values successful
and failed
.
Upvotes: 5
Views: 4096
Reputation: 84
It is more common to experience the problem when you are running in a shell or virtual environment or there is a conflict between the various packages in your directories. They might be a conflict with XGBoost and some other libraries.
I have experienced this with a running system that suddenly stopped working.
If everything was fine with your code at some point then this is likely the case. You will have to do some reinstallations of all major packages with XGBoost included. @Mohamad Osman above has provided a good step to follow on this.
Upvotes: 0
Reputation: 113
I've had also the same error; but in my case, the error was resolved by converting the bool
columns to numeric
.
Upvotes: 0
Reputation: 56
xgboost library is currently under updating to fix this bug, so the current solution is to downgrade the library to older versions, for me I have solved this problem by downgrading to xgboost v0.90
Try to check your xgboost version by cmd:
python
import xgboost
print(xgboost.__version__)
exit()
If the version was not 0.90 then uninstall the current version by:
pip uninstall xgboost
Install xgboost version 0.90
pip install xgboost==0.90
run your code again!
Upvotes: 3
Reputation: 66
This bug will be fixed in XGBoost 1.4.2
See: https://github.com/dmlc/xgboost/pull/6927
Upvotes: 2
Reputation: 7313
I am using python 3.8.6 on macOS Big Sur and just encountered this error with xgboost==1.4.0 and 1.4.1. When I downgraded to 1.3.3 the issue went away. Try upgrading or downgrading depending on your current version.
Upvotes: 1