OverFitter
OverFitter

Reputation: 49

Typeerror with VotingClassifier

I want to use VotingClassifier, but I have some problems with cross validating

    x_train, x_validation, y_train, y_validation = train_test_split(x, y, test_size=.22, random_state=2)
    x_train = x_train.fillna(0)
    clf1 = CatBoostClassifier()
    clf2 = RandomForestClassifier()
    clf = VotingClassifier(estimators=[('cb', clf1), ('rf', clf2)])
    clf.fit(x_train.values(), y_train)

I have an error with predicting...

    cross_validate(clf, x_train, y_train, scoring='accuracy', return_train_score = True, n_jobs = 4)

TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'

(full error here)


and download x_train and y_train here ↓

x_train
y_train

Upvotes: 0

Views: 1390

Answers (1)

Vivek Kumar
Vivek Kumar

Reputation: 36599

This error is because of this line:

np.bincount(x, weights=self._weights_not_none)

Here x is the predictions returned by the individual classifiers inside the VotingClassifier.

According to the documentation of np.bincount:

Count number of occurrences of each value in array of non-negative ints.

x : array_like, 1 dimension, nonnegative ints

This method requires only int values in the array.

Now your code will work if you replace the CatBoostClassifier with any other Scikit-learn classifier. Because all scikit-learn estimators return array of np.int64 from their predict().

But CatBoostClassifier returns np.float64 as the output. And hence the error. Actually it should also return int64 because the predict() function should return the classes not any float values. But I dont know why it returns float.

You can correct this by extending the CatBoostClassifier class and converting the predictions on the fly.

import numpy as np
from catboost import CatBoostClassifier
class CatBoostClassifierInt(CatBoostClassifier):
    def predict(self, data, prediction_type='Class', ntree_start=0, ntree_end=0, thread_count=1, verbose=None):
        predictions = self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose)

        # This line is the only change I did
        return np.asarray(predictions, dtype=np.int64).ravel()

clf1 = CatBoostClassifierInt()
clf2 = RandomForestClassifier()
clf = VotingClassifier(estimators=[('cb', clf1), ('rf', clf2)])
cross_validate(clf, x_train, y_train, scoring='accuracy', return_train_score = True)

Now you wont get that error.

More correct version should be this. This will handle all the types of labels with matching input and output and can be used in scikit with ease:

class CatBoostClassifierCorrected(CatBoostClassifier):
    def fit(self, X, y=None, cat_features=None, sample_weight=None, baseline=None, use_best_model=None,
        eval_set=None, verbose=None, logging_level=None, plot=False, column_description=None, verbose_eval=None):

        self.le_ = LabelEncoder().fit(y)
        transformed_y = self.le_.transform(y)

        self._fit(X, transformed_y, cat_features, None, sample_weight, None, None, None, baseline, use_best_model, eval_set, verbose, logging_level, plot, column_description, verbose_eval)
        return self

    def predict(self, data, prediction_type='Class', ntree_start=0, ntree_end=0, thread_count=1, verbose=None):
        predictions = self._predict(data, prediction_type, ntree_start, ntree_end, thread_count, verbose)

        # This line is the only change I did
        return self.le_.inverse_transform(predictions.astype(np.int64))

This will handle all different types of labels

Upvotes: 1

Related Questions