scikit learn: elastic net approaching ridge

Question

So elastic net is supposed to be a hybrid between ridge regression (L2 regularization) and lasso (L1 regularization). However, it seems that even if l1_ratio is 0 I don't get the same result as ridge. I know ridge using gradient descent and elastic net uses coordinate descent, but the optima should be the same, no? Moreover, I've found that elasticnet often throws ConvergenceWarnings for no obvious reason, while lasso and ridge don't. Here's a snippet:

from sklearn.datasets import load_boston
from sklearn.utils import shuffle
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.model_selection import train_test_split

data = load_boston()
X, y = shuffle(data.data, data.target, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)
alpha = 1

en = ElasticNet(alpha=alpha, l1_ratio=0)
en.fit(X_train, y_train)
print('en train score: ', en.score(X_train, y_train))

rr = Ridge(alpha=alpha)
rr.fit(X_train, y_train)
print('rr train score: ', rr.score(X_train, y_train))

lr = Lasso(alpha=alpha)
lr.fit(X_train, y_train)
print('lr train score: ', lr.score(X_train, y_train))
print('---')
print('en test score: ', en.score(X_test, y_test))
print('rr test score: ', rr.score(X_test, y_test))
print('lr test score: ', lr.score(X_test, y_test))
print('---')
print('en coef: ', en.coef_)
print('rr coef: ', rr.coef_)
print('lr coef: ', lr.coef_)

Even though l1_ratio is 0, the train and test scores of elastic net are close to the lasso scores (and not ridge as you would expect). Moreover, elastic net seems to throw a ConvergenceWarning, even if I increase max_iter (even up to 1000000 there seems to be no effect) and tol (0.1 still throws an error, but 0.2 doesn't). Increasing alpha (as the warning suggests) also has no effect.

sascha · Accepted Answer

Just read the docs. Then you will find out that none of these is using Gradient-descent and more important:

Ridge

Elastic Net

which shows, when substituting a=1, p=0, that:

ElasticNet has one more sample-dependent factor on top of the loss not found in Ridge
ElasticNet has one more 1/2 factor in the l2-term

Why different models? Probably because sklearn follows the canonical/original R-based implementation glmnet.

Furthermore i would not be surprised to see numerical-issues when doing mixed-norm optimization while i'm forcing a non-mixed norm like l1=0, especially when there are specialized solvers for both non-mixed optimization-problems.

Luckily, sklearn even has to say something about it:

Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha.

scikit learn: elastic net approaching ridge

Answers (2)

Related Questions