user16386186
user16386186

Reputation:

Why can't I get the result I got with the sklearn LogisticRegression with the coefficients_sgd method?

from math import exp
import numpy as np
from sklearn.linear_model import LogisticRegression

I used code below from How To Implement Logistic Regression From Scratch in Python

def predict(row, coefficients):
    yhat = coefficients[0]
    for i in range(len(row)-1):
        yhat += coefficients[i + 1] * row[i]
    return 1.0 / (1.0 + exp(-yhat))

def coefficients_sgd(train, l_rate, n_epoch):
    coef = [0.0 for i in range(len(train[0]))]
    for epoch in range(n_epoch):
        sum_error = 0
        for row in train:
            yhat = predict(row, coef)
            error = row[-1] - yhat
            sum_error += error**2
            coef[0] = coef[0] + l_rate * error * yhat * (1.0 - yhat)
            for i in range(len(row)-1):
                coef[i + 1] = coef[i + 1] + l_rate * error * yhat * (1.0 - yhat) * row[i]
    return coef

dataset = [[2.7810836,2.550537003,0],
[1.465489372,2.362125076,0],
[3.396561688,4.400293529,0],
[1.38807019,1.850220317,0],
[3.06407232,3.005305973,0],
[7.627531214,2.759262235,1],
[5.332441248,2.088626775,1],
[6.922596716,1.77106367,1],
[8.675418651,-0.242068655,1],
[7.673756466,3.508563011,1]]

l_rate = 0.3
n_epoch = 100
coef = coefficients_sgd(dataset, l_rate, n_epoch)
print(coef)

[-0.39233141593823756, 1.4791536027917747, -2.316697087065274]

x = np.array(dataset)[:,:2]
y = np.array(dataset)[:,2]
model = LogisticRegression(penalty="none")
model.fit(x,y)
print(model.intercept_.tolist() + model.coef_.ravel().tolist())

[-3.233238244349982, 6.374828107647225, -9.631487530388092]

What should I change to get the same or closer coefficients ? How can I establish initial coefficients , learning rate , n_epoch ?

Upvotes: 6

Views: 271

Answers (1)

Sanjar Adilov
Sanjar Adilov

Reputation: 1099

Well, there are many nuances here 🙂

First, recall that estimating coefficients of logistic regression with (negative) log-likelihood is possible using various optimization methods, including SGD you implemented, but there is no exact, closed-form solution. So even if you implement an exact copy of scikit-learn's LogisticRegression, you will need to set the same hyperparameters (number of epochs, learning rate, etc.) and random state to obtain the same coefficients.

Second, LogisticRegression offers five different optimization methods (solver parameter). You run LogisticRegression(penalty="none") with its default parameters and the default for solver is 'lbfgs', not SGD; so depending on your data and hyperparameters, you may get significantly different results.

What should I change to get the same or closer coefficients ?

I would suggest comparing your implementation with SGDClassifier(loss='log') first, since LogisticRegression does not offer SGD solver. Although keep in mind that scikit-learn's implementation is more sophisticated, in particular having more hyperparameters for early stopping like tol.

How can I establish initial coefficients, learning rate, n_epoch?

Typically, coefficients for SGD are initialized randomly (e.g., uniform(-1/(2n), 1/(2n))), using some data statistics (e.g., dot(y, w)/(dot(w, w) for every coefficient w), or with pre-trained model's parameters. On the contrary, there is no golden rule for learning rate or number of epochs. Usually, we set a big number of epochs and some other stopping criterion (e.g., whether norm between current and previous coefficients is smaller than some small tol), a moderate learning rate, and every iteration we reduce the learning rate following some rule (see learning_rate parameter of SGDClassifier or User Guide) and check the stopping criterion.

Upvotes: 7

Related Questions