JE_Muc
JE_Muc

Reputation: 5774

sklearn stratified k-fold CV with linear model like ElasticNetCV

using cross validation (CV) with sklearn is quite easy and straight-forward. But the default implementation when setting cv=5 in a linear CV model, like ElasticNetCV or LassoCV is a KFold CV. For various reasons I'd like to use a StratifiedKFold. From the documentation, it seems like any CV method can be given with cv=.

Passing cv=KFold(5) works as expected, but cv=StratifiedKFold(5) raises the Error:

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

I know that I can use cross_val_score after fitting, but I'd like to pass StratifiedKFold as CV directly to the linear model.

My minimum working example is:

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)

# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y)  # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y)  # also works fine

# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y)  # THIS RAISES THE ERROR

Any idea how I can set StratifiedKFold as CV directly?

Upvotes: 2

Views: 5302

Answers (1)

Sergey Bushmanov
Sergey Bushmanov

Reputation: 25199

The root of your problem is this line:

y = np.arange(100) + np.random.rand(100)

StratifiedKFold cannot sample from continuous distribution hence your error. Try changing this line and your code will execute happily:

from sklearn.linear_model import ElasticNetCV
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.random.choice([0,1], size=100)

# KFold default implementation:
model_default = ElasticNetCV(cv=5)
model_default.fit(x, y)  # works fine
# KFold given as cv explicitly:
model_kfexp = ElasticNetCV(cv=KFold(5))
model_kfexp.fit(x, y)  # also works fine

# StratifiedKFold given as cv explicitly:
model_skf = ElasticNetCV(cv=StratifiedKFold(5))
model_skf.fit(x, y)  # no ERROR

NOTE

If you sample on continuous data, use KFold. If your target is categorical you may use both KFold and StratifiedKFold whichever suits your needs.

NOTE 2

If you insist on emulating stratified sampling on continuous data, you may wish to apply pandas.cut to your data, then do stratified sampling on that data, and finally pass resulting (train_id, test_id) generator to cv param:

x = np.arange(100, dtype=np.float64).reshape(-1, 1)
y = np.arange(100) + np.random.rand(100)

y_cat = pd.cut(y, 10, labels=range(10))
skf_gen = StratifiedKFold(5).split(x, y_cat)

model_skf = ElasticNetCV(cv=skf_gen)
model_skf.fit(x, y)  # no ERROR

Upvotes: 4

Related Questions