wouterdobbels
wouterdobbels

Reputation: 518

scikit learn: elastic net approaching ridge

So elastic net is supposed to be a hybrid between ridge regression (L2 regularization) and lasso (L1 regularization). However, it seems that even if l1_ratio is 0 I don't get the same result as ridge. I know ridge using gradient descent and elastic net uses coordinate descent, but the optima should be the same, no? Moreover, I've found that elasticnet often throws ConvergenceWarnings for no obvious reason, while lasso and ridge don't. Here's a snippet:

from sklearn.datasets import load_boston
from sklearn.utils import shuffle
from sklearn.linear_model import ElasticNet, Ridge, Lasso
from sklearn.model_selection import train_test_split

data = load_boston()
X, y = shuffle(data.data, data.target, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=43)
alpha = 1

en = ElasticNet(alpha=alpha, l1_ratio=0)
en.fit(X_train, y_train)
print('en train score: ', en.score(X_train, y_train))

rr = Ridge(alpha=alpha)
rr.fit(X_train, y_train)
print('rr train score: ', rr.score(X_train, y_train))

lr = Lasso(alpha=alpha)
lr.fit(X_train, y_train)
print('lr train score: ', lr.score(X_train, y_train))
print('---')
print('en test score: ', en.score(X_test, y_test))
print('rr test score: ', rr.score(X_test, y_test))
print('lr test score: ', lr.score(X_test, y_test))
print('---')
print('en coef: ', en.coef_)
print('rr coef: ', rr.coef_)
print('lr coef: ', lr.coef_)

Even though l1_ratio is 0, the train and test scores of elastic net are close to the lasso scores (and not ridge as you would expect). Moreover, elastic net seems to throw a ConvergenceWarning, even if I increase max_iter (even up to 1000000 there seems to be no effect) and tol (0.1 still throws an error, but 0.2 doesn't). Increasing alpha (as the warning suggests) also has no effect.

Upvotes: 4

Views: 2584

Answers (2)

Lei
Lei

Reputation: 833

Based on the answer by @sascha, one can match the results between the two models:

import sklearn
print(sklearn.__version__)

from sklearn.linear_model import Ridge, ElasticNet
from sklearn.datasets import load_boston

dataset = load_boston()
X = dataset.data
y = dataset.target

f = Ridge(alpha=1, 
          fit_intercept=True, normalize=False, 
          copy_X=True, max_iter=1000, tol=1e-4, random_state=42, 
          solver='auto')
g = ElasticNet(alpha=1/X.shape[0], l1_ratio=1e-16, 
               fit_intercept=True, normalize=False, 
               copy_X=True, max_iter=1000, tol=1e-4, random_state=42, 
               precompute=False, warm_start=False, 
               positive=False, selection='cyclic')

f.fit(X, y)
g.fit(X, y)

print(abs(f.coef_ - g.coef_) / abs(f.coef_))

Output:

0.19.2
[1.19195623e-14 1.17076625e-15 3.25973465e-13 1.61694280e-14
 4.77274767e-15 4.15332538e-15 6.15640568e-14 1.61772832e-15
 4.56125088e-14 5.44320605e-14 8.99189018e-15 2.31213025e-15
 3.74181954e-15]

Upvotes: 4

sascha
sascha

Reputation: 33532

Just read the docs. Then you will find out that none of these is using Gradient-descent and more important:

Ridge

enter image description here

enter image description here

Elastic Net

enter image description here enter image description here

which shows, when substituting a=1, p=0, that:

  • ElasticNet has one more sample-dependent factor on top of the loss not found in Ridge
  • ElasticNet has one more 1/2 factor in the l2-term

Why different models? Probably because sklearn follows the canonical/original R-based implementation glmnet.

Furthermore i would not be surprised to see numerical-issues when doing mixed-norm optimization while i'm forcing a non-mixed norm like l1=0, especially when there are specialized solvers for both non-mixed optimization-problems.

Luckily, sklearn even has to say something about it:

Currently, l1_ratio <= 0.01 is not reliable, unless you supply your own sequence of alpha.

Upvotes: 4

Related Questions