Reputation: 33
I am creating a custom classifier using scikit-learn interface just for learning reasons. So, i've came up with the following code:
import numpy as np
from sklearn.utils.estimator_checks import check_estimator
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state
class TemplateEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, threshold=0.5, random_state=None):
self.threshold = threshold
self.random_state = random_state
def fit(self, X, y):
self.random_state_ = check_random_state(self.random_state)
X, y = check_X_y(X, y)
self.classes_ = np.unique(y)
self.fitted_ = True
return self
def predict(self, X):
check_is_fitted(self)
X = check_array(X)
y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
return y_hat
check_estimator(TemplateEstimator())
This classifier simply do random guesses. I tried my best to follow scikit-learn documentation and guidelines for developing my own estimator. However, i get the following error:
AssertionError:
Arrays are not equal
Classifier cant predict when only one class is present.
Mismatched elements: 10 / 10 (100%)
Max absolute difference: 1.
Max relative difference: 1.
x: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
y: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
I can´t be sure, but i guess the randomness (i.e. self.random_state_
) is causing the error. I am using sklearn version 1.0.2
.
Upvotes: 3
Views: 717
Reputation: 4896
First thing to note is that you can get a much better output if you use parametrize_with_checks
with pytest
instead of check_estimator
. It would look like:
@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
check(estimator)
And if you run that with pytest, you'll get an output with the following failed tests:
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_pipeline_consistency] - AssertionError:
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_train(readonly_memmap=True,X_dtype=float32)] - AssertionError
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_classifiers_regression_target] - AssertionError: Did not raise: [<class 'ValueErr...
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_sample_order_invariance] - AssertionError:
FAILED ../../../../tmp/1.py::test_sklearn_compatible_estimator[TemplateEstimator()-check_methods_subset_invariance] - AssertionError:
Some of those tests check for some output consistency, which is not relevant in your case since you return random values. In this case, you need to set the non_deterministic
estimator tag
. Some other tests such as check_classifiers_regression_target
check if you do the right validations and raise the right error, which you don't. So you either need to fix that, or add the no_validation
tag. Another issue is that check_classifier_train
checks if your model gives reasonable output for a given problem. But since you're returning random values, those conditions are not met. You can set the poor_score
estimator tag to skip that.
You can add these tags by adding this to your estimator:
class TemplateEstimator(BaseEstimator, ClassifierMixin):
...
def _more_tags(self):
return {
"non_deterministic": True,
"no_validation": True,
"poor_score": True,
}
But even then, two tests would fail if you use the main
branch of scikit-learn or the nightly builds. I believe this needs a fix and I've opened an issue for it (EDIT: the fix is now merged with the upstream and will be available in the next release). You can avoid these failures by setting these tests as expecting to fail in your tags. At the end, your estimator would look like:
import numpy as np
from sklearn.utils.estimator_checks import parametrize_with_checks
from sklearn.base import BaseEstimator, ClassifierMixin, check_X_y
from sklearn.utils.validation import check_array, check_is_fitted, check_random_state
class TemplateEstimator(BaseEstimator, ClassifierMixin):
def __init__(self, threshold=0.5, random_state=None):
self.threshold = threshold
self.random_state = random_state
def fit(self, X, y):
self.random_state_ = check_random_state(self.random_state)
X, y = check_X_y(X, y)
self.classes_ = np.unique(y)
self.fitted_ = True
return self
def predict(self, X):
check_is_fitted(self)
X = check_array(X)
y_hat = self.random_state_.choice(self.classes_, size=X.shape[0])
return y_hat
def _more_tags(self):
return {
"non_deterministic": True,
"no_validation": True,
"poor_score": True,
"_xfail_checks": {
"check_methods_sample_order_invariance": "This test shouldn't be running at all!",
"check_methods_subset_invariance": "This test shouldn't be running at all!",
},
}
@parametrize_with_checks([TemplateEstimator()])
def test_sklearn_compatible_estimator(estimator, check):
check(estimator)
Upvotes: 1