Mohamed Moustafa
Mohamed Moustafa

Reputation: 377

Improving accuracy of multinomial logistic regression model built from scratch

I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:

class MultinomialLogReg:
    def fit(self, X, y, lr=0.00001, epochs=1000):
        self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
        self.y = y
        self.classes = np.unique(y)
        self.theta = np.zeros((len(self.classes), self.X.shape[1]))
        self.o_h_y = self.one_hot(y)
        
        for e in range(epochs):
            preds = self.probs(self.X)

            l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
            
            if e%10000 == 0:
                print("epoch: ", e, "loss: ", l)
            
            self.theta -= (lr*grad)
        
        return self
    
    def norm_x(self, X):
        for i in range(X.shape[0]):
            mn = np.amin(X[i])
            mx = np.amax(X[i])
            X[i] = (X[i] - mn)/(mx-mn)
        return X
    
    def one_hot(self, y):
        Y = np.zeros((y.shape[0], len(self.classes)))
        for i in range(Y.shape[0]):
            to_put = [0]*len(self.classes)
            to_put[y[i]] = 1
            Y[i] = to_put
        return Y
    
    def probs(self, X):
        return self.softmax(np.dot(X, self.theta.T))
    
    def get_loss(self, w,x,y,preds):
        m = x.shape[0]
        
        loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
        
        grad = (1 / m) * (np.dot((preds - y).T, x)) #And compute the gradient for that loss
        
        return loss,grad

    def softmax(self, z):
        return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
    
    def predict(self, X):
        X = np.insert(X, 0, 1, axis=1)
        return np.argmax(self.probs(X), axis=1)
        #return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
        
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

And had several questions:

  1. Is this a correct mutlinomial logistic regression implementation?

  2. It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set. Would this be considered bad performance?

  3. What are some ways for improving performance or speeding up training (to need less epochs)?

  4. I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:

error = preds - self.o_h_y
grad = np.dot(error.T, self.X)
self.theta -= (lr*grad)

Upvotes: 1

Views: 963

Answers (1)

Melih Elibol
Melih Elibol

Reputation: 86

  1. This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.
  2. It's hard to know whether this is good or bad. While the loss landscape is convex, the time it takes to obtain a minima varies for different problems. One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima. Something like np.linalg.norm(grad) < 1e-8.
  3. You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS. I would start with Newton's method as it's easier to implement. LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method.
  4. It's the same; the gradients aren't being averaged. Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways. In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.

A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?

Upvotes: 1

Related Questions