Reputation: 5774
using cross validation (CV) with sklearn
is quite easy and straight-forward. But the default implementation when setting cv=5
in a linear CV model, like ElasticNetCV
or LassoCV
is a KFold
CV. For various reasons I'd like to use a StratifiedKFold
. From the documentation, it seems like any CV method can be given with cv=
.
Passing cv=KFold(5)
works as expected, but cv=StratifiedKFold(5)
raises the Error:
ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.
I know that I can use cross_val_score
after fitting, but I'd like to pass StratifiedKFold
as CV directly to the linear model.
My minimum working example is:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # THIS RAISES THE ERROR
Any idea how I can set StratifiedKFold
as CV directly?
Upvotes: 2
Views: 5302
Reputation: 25199
The root of your problem is this line:
y = np.arange(100) + np.random.rand(100)
StratifiedKFold
cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:
from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)
# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y) # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y) # also works fine
# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y) # no ERROR
NOTE
If you sample on continuous data, use KFold
. If your target is categorical you may use both KFold
and StratifiedKFold
whichever suits your needs.
NOTE 2
If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut
to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id)
generator to cv
param:
x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)
y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)
model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y) # no ERROR
Upvotes: 4