ronswamson
ronswamson

Reputation: 33

sklearn AssertionError: not equal to tolerance on custom estimator

I am creating a custom classifier using scikit-learn interface just for learning reasons. So, i've came up with the following code:

import numpy as np
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state

class TemplateEstimator(BaseEstimator, ClassifierMixin):
  def __init__(self, threshold=0.5, random_state=None):
    self.threshold = threshold
    self.random_state = random_state

  def fit(self, X, y):
    self.random_state_ = check_random_state(self.random_state)
    X, y = check_X_y(X, y)
    self.classes_ = np.unique(y)
    self.fitted_ = True
    return self
  
  def predict(self, X):
    check_is_fitted(self)
    X = check_array(X)

    y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
    return y_hat

check_estimator(TemplateEstimator())

This classifier simply do random guesses. I tried my best to follow scikit-learn documentation and guidelines for developing my own estimator. However, i get the following error:

AssertionError: 
Arrays are not equal
Classifier cant predict when only one class is present.
Mismatched elements: 10 / 10 (100%)
Max absolute difference: 1.
Max relative difference: 1.
 x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
 y: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

I can´t be sure, but i guess the randomness (i.e. self.random_state_) is causing the error. I am using sklearn version 1.0.2.

Upvotes: 3

Views: 717

Answers (1)

adrin
adrin

Reputation: 4896

First thing to note is that you can get a much better output if you use parametrize_with_checks with pytest instead of check_estimator. It would look like:

@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

And if you run that with pytest, you'll get an output with the following failed tests:

FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_pipeline_consistency] - AssertionError: 
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_regression_target] - AssertionError: Did not raise: [<class 'ValueErr...
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_sample_order_invariance] - AssertionError: 
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_subset_invariance] - AssertionError: 

Some of those tests check for some output consistency, which is not relevant in your case since you return random values. In this case, you need to set the non_deterministic estimator tag. Some other tests such as check_classifiers_regression_target check if you do the right validations and raise the right error, which you don't. So you either need to fix that, or add the no_validation tag. Another issue is that check_classifier_train checks if your model gives reasonable output for a given problem. But since you're returning random values, those conditions are not met. You can set the poor_score estimator tag to skip that.

You can add these tags by adding this to your estimator:

class TemplateEstimator(BaseEstimator, ClassifierMixin):
    ...
    def _more_tags(self):
        return {
            "non_deterministic": True,
            "no_validation": True,
            "poor_score": True,
        }

But even then, two tests would fail if you use the main branch of scikit-learn or the nightly builds. I believe this needs a fix and I've opened an issue for it (EDIT: the fix is now merged with the upstream and will be available in the next release). You can avoid these failures by setting these tests as expecting to fail in your tags. At the end, your estimator would look like:

import numpy as np
from sklearn.utils.estimator_checks import parametrize_with_checks
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state


class TemplateEstimator(BaseEstimator, ClassifierMixin):
    def __init__(self, threshold=0.5, random_state=None):
        self.threshold = threshold
        self.random_state = random_state

    def fit(self, X, y):
        self.random_state_ = check_random_state(self.random_state)
        X, y = check_X_y(X, y)
        self.classes_ = np.unique(y)
        self.fitted_ = True
        return self

    def predict(self, X):
        check_is_fitted(self)
        X = check_array(X)

        y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
        return y_hat

    def _more_tags(self):
        return {
            "non_deterministic": True,
            "no_validation": True,
            "poor_score": True,
            "_xfail_checks": {
                "check_methods_sample_order_invariance": "This test shouldn't be running at all!",
                "check_methods_subset_invariance": "This test shouldn't be running at all!",
            },
        }


@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
    check(estimator)

Upvotes: 1

Related Questions