Stanislav Jirák
Stanislav Jirák

Reputation: 485

CatBoostError: cat_features must be integer or string, real number values and NaN values should be converted to string

I have a dataset with 122 columns which looks like:

train.head()

SK_ID_CURR  TARGET  NAME_CONTRACT_TYPE  CODE_GENDER FLAG_OWN_CAR    FLAG_OWN_REALTY CNT_CHILDREN    AMT_INCOME_TOTAL    AMT_CREDIT  AMT_ANNUITY ... FLAG_DOCUMENT_18    FLAG_DOCUMENT_19    FLAG_DOCUMENT_20    FLAG_DOCUMENT_21    AMT_REQ_CREDIT_BUREAU_HOUR  AMT_REQ_CREDIT_BUREAU_DAY   AMT_REQ_CREDIT_BUREAU_WEEK  AMT_REQ_CREDIT_BUREAU_MON   AMT_REQ_CREDIT_BUREAU_QRT   AMT_REQ_CREDIT_BUREAU_YEAR
0   100002  1   Cash loans  M   N   Y   0   202500.0    406597.5    24700.5 ... 0   0   0   0   0   0   0   0   0   1
1   100003  0   Cash loans  F   N   N   0   270000.0    1293502.5   35698.5 ... 0   0   0   0   0   0   0   0   0   0
2   100004  0   Revolving loans M   Y   Y   0   67500.0 135000.0    6750.0  ... 0   0   0   0   0   0   0   0   0   0
3   100006  0   Cash loans  F   N   Y   0   135000.0    312682.5    29686.5 ... 0   0   0   0   255 255 255 255 65535   255
4   100007  0   Cash loans  M   N   Y   0   121500.0    

I've imputed all NaNs and wanna use CatBoost now as follows:

# Get variables for a model
x = train.drop(["TARGET"], axis=1)
y = train["TARGET"]

#Do train data splitting
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=42)

cat_features = np.where(x.dtypes != float)[0]

cat = CatBoostClassifier(one_hot_max_size=7, iterations=21, random_seed=42, use_best_model=True, eval_metric='Accuracy', loss_function='Logloss')

cat.fit(X_train, y_train, cat_features = cat_features, eval_set=(X_test, y_test))
pred = cat.predict(X_test)

pool = Pool(X_train, y_train, cat_features=cat_features)
cv_scores = cv(pool, cat.get_params(), fold_count=10, plot=True)
print('CV score: {:.5f}'.format(cv_scores['test-Accuracy-mean'].values[-1]))
print('The test accuracy is :{:.6f}'.format(accuracy_score(y_test, cat.predict(X_test))))

which raises:

CatBoostError: Invalid type for cat_feature[534,6]=118975.5 : cat_features must be integer or string, real number values and NaN values should be converted to string.

NaNs all are imputed as mentioned (checked) and in the code is stated that cat_features are other than real numbers.

Would someone help me to solve the mystery, please?

Upvotes: 5

Views: 23383

Answers (4)

Leon
Leon

Reputation: 468

I believe that your data has missing values and you had not imputed them. That is the possible reason based on my experience. You can .fillna(-999, inplace=Ture) for all your features. Thereafter, these error messages would disappear. You may doubt the imputation. Do not worry about it. Catboost will classify these missing values filled with -999 into a category. When you have done this and get your result, you need to go back to inspect the reasonability.

Hope this would be helpful.

Upvotes: 0

Novan Dwi Atmaja
Novan Dwi Atmaja

Reputation: 29

Please check your feature names order on your model cat.feature_names_. For the safe way is

cat.predict(X_test[cat.feature_names_])

Upvotes: 1

Akavall
Akavall

Reputation: 86356

You are trying to use a column with dtype float for categorical column. To fix the error convert it to an int;

train["a"] = train["a"].astype(np.int) 

however, in your case 118975.5 doesn't look like a valid category, so you might want to double check if you want to use that column as categorical.

Here is small example that reproduces the error and fix:

from catboost import CatBoostRegressor
import numpy as np
import pandas as pd

train_data = [[1, 4],
              [4.0, 5]]

train = pd.DataFrame(train_data, columns=["a", "b"])

# train["a"] = train["a"].astype(np.int) # This line fixes Invalid type for cat_feature issue

train_labels = [10, 20]
model = CatBoostRegressor(iterations=2,
                          cat_features=["a"]
                          )
model.fit(train, train_labels)

Upvotes: 7

Alex Ramses
Alex Ramses

Reputation: 577

It wasn't exactly a solution, but I figure that 'cat_feature[534,6]=118975.5' tell you that there is some problem on the 7th column.

I'm facing a similar problem now.

Upvotes: 2

Related Questions