Jonathan Bechtel
Jonathan Bechtel

Reputation: 3607

XGBoostError: Check failed: typestr.size() == 3 (2 vs. 3) : `typestr' should be of format <endian><type><size of type in bytes>

I'm having a weird issue with a new installation of xgboost. Under normal circumstances it works fine. However, when I use the model in the following function it gives the error in the title.

The dataset I'm using is borrowed from kaggle, and can be seen here: https://www.kaggle.com/kemical/kickstarter-projects

The function I use to fit my model is the following:

def get_val_scores(model, X, y, return_test_score=False, return_importances=False, random_state=42, randomize=True, cv=5, test_size=0.2, val_size=0.2, use_kfold=False, return_folds=False, stratify=True):
    print("Splitting data into training and test sets")
    if randomize:
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=True, random_state=random_state)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=True, random_state=random_state)
    else:
        if stratify:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, stratify=y, shuffle=False)
        else:
            X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, shuffle=False)
    print(f"Shape of training data, X: {X_train.shape}, y: {y_train.shape}.  Test, X: {X_test.shape}, y: {y_test.shape}")
    if use_kfold:
        val_scores = cross_val_score(model, X=X_train, y=y_train, cv=cv)
    else:
        print("Further splitting training data into validation sets")
        if randomize:
            if stratify:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train, shuffle=True)
            else:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=True)
        else:
            if stratify:
                print("Warning! You opted to both stratify your training data and to not randomize it.  These settings are incompatible with scikit-learn.  Stratifying the data, but shuffle is being set to True")
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, stratify=y_train,  shuffle=True)
            else:
                X_train_, X_val, y_train_, y_val = train_test_split(X_train, y_train, test_size=val_size, shuffle=False)
        print(f"Shape of training data, X: {X_train_.shape}, y: {y_train_.shape}.  Val, X: {X_val.shape}, y: {y_val.shape}")
        print("Getting ready to fit model.")
        model.fit(X_train_, y_train_)
        val_score = model.score(X_val, y_val)
        
    if return_importances:
        if hasattr(model, 'steps'):
            try:
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
            except:
                model.fit(X_train, y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model[-2].feature_importances_
                }).sort_values(by='Importance', ascending=False)
        else:
            try:
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
            except:
                model.fit(X_train, y_train)
                feats = pd.DataFrame({
                    'Columns': X.columns,
                    'Importance': model.feature_importances_
                }).sort_values(by='Importance', ascending=False)
            
    mod_scores = {}
    try:
        mod_scores['validation_score'] = val_scores.mean()
        if return_folds:
            mod_scores['fold_scores'] = val_scores
    except:
        mod_scores['validation_score'] = val_score
        
    if return_test_score:
        mod_scores['test_score'] =  model.score(X_test, y_test)
            
    if return_importances:
        return mod_scores, feats
    else:
        return mod_scores

The weird part that I'm running into is that if I create a pipeline in sklearn, it works on the dataset outside of the function, but not within it. For example:

from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from xgboost import XGBClassifier

pipe = make_pipeline(OrdinalEncoder(), XGBClassifier())

X = df.drop('state', axis=1)
y = df['state']

In this case, pipe.fit(X, y) works just fine. But get_val_scores(pipe, X, y) fails with the error message in the title. What's weirder is that get_val_scores(pipe, X, y) seems to work with other datasets, like Titanic. The error occurs as the model is fitting on X_train and y_train.

In this case the loss function is binary:logistic, and the state column has the values successful and failed.

Upvotes: 5

Views: 4096

Answers (5)

Inuwa Mobarak
Inuwa Mobarak

Reputation: 84

It is more common to experience the problem when you are running in a shell or virtual environment or there is a conflict between the various packages in your directories. They might be a conflict with XGBoost and some other libraries.

I have experienced this with a running system that suddenly stopped working.

If everything was fine with your code at some point then this is likely the case. You will have to do some reinstallations of all major packages with XGBoost included. @Mohamad Osman above has provided a good step to follow on this.

Upvotes: 0

heschmat
heschmat

Reputation: 113

I've had also the same error; but in my case, the error was resolved by converting the bool columns to numeric.

Upvotes: 0

Mohamad Osman
Mohamad Osman

Reputation: 56

xgboost library is currently under updating to fix this bug, so the current solution is to downgrade the library to older versions, for me I have solved this problem by downgrading to xgboost v0.90

Try to check your xgboost version by cmd:

python 

import xgboost

print(xgboost.__version__)

exit()

If the version was not 0.90 then uninstall the current version by:

pip uninstall xgboost

Install xgboost version 0.90

pip install xgboost==0.90

run your code again!

Upvotes: 3

Kyle
Kyle

Reputation: 66

This bug will be fixed in XGBoost 1.4.2

See: https://github.com/dmlc/xgboost/pull/6927

Upvotes: 2

sedeh
sedeh

Reputation: 7313

I am using python 3.8.6 on macOS Big Sur and just encountered this error with xgboost==1.4.0 and 1.4.1. When I downgraded to 1.3.3 the issue went away. Try upgrading or downgrading depending on your current version.

Upvotes: 1

Related Questions