Use of scipy.optimize.minimize in Neural Network

Question

Trying to use Backpropagation Neural Network for multiclass classification. I have found this code and try to adapt it. It is based on the lections of Machine Learning in Coursera from Andrew Ng.

I don't understand exactly the implementation of scipy.optimize.minimize function here. It is used just once in the code. Is it iteratively updating the weights of the network? How can I visualize (plot) it's performance to see when it converges?

Using this function what parameters I can adjust to achieve better performance? I found here a list common parameters:

Number of neurons in the hidden layer: this is hidden_layer_size=25 in my code
Learning rate: can I still adjust that using built-in minimization function?
Momentum: is that reg_lambda=0 in my case? Regularization parameter to avoid overfitting, right?
Epoch: maxiter=500

Here is my training data (target class is in the last column):

65535, 3670, 65535, 3885, -0.73, 1
65535, 3962, 65535, 3556, -0.72, 1
65535, 3573, 65535, 3529, -0.61, 1
3758, 3123, 4117, 3173, -0.21, 0
3906, 3119, 4288, 3135, -0.28, 0
3750, 3073, 4080, 3212, -0.26, 0
65535, 3458, 65535, 3330, -0.85, 2
65535, 3315, 65535, 3306, -0.87, 2
65535, 3950, 65535, 3613, -0.84, 2
65535, 32576, 65535, 19613, -0.35, 3
65535, 16657, 65535, 16618, -0.37, 3
65535, 16657, 65535, 16618, -0.32, 3

The dependencies are so obvious, I think it should be so easy to classify it...

But results are terrible. I get accuracy of 0.6 to 0.8. This is absolutely inappropriate for my application. I know I need more data normally, but I would be already happy when I could at least fit the training data (without taking into account potential overfitting)

Here is the code:

import numpy as np
from scipy import optimize

from sklearn import cross_validation
from sklearn.metrics import accuracy_score
import math

class NN_1HL(object):

    def __init__(self, reg_lambda=0, epsilon_init=0.12, hidden_layer_size=25, opti_method='TNC', maxiter=500):
        self.reg_lambda = reg_lambda
        self.epsilon_init = epsilon_init
        self.hidden_layer_size = hidden_layer_size
        self.activation_func = self.sigmoid
        self.activation_func_prime = self.sigmoid_prime
        self.method = opti_method
        self.maxiter = maxiter

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))

    def sigmoid_prime(self, z):
        sig = self.sigmoid(z)
        return sig * (1 - sig)

    def sumsqr(self, a):
        return np.sum(a ** 2)

    def rand_init(self, l_in, l_out):
        self.epsilon_init = (math.sqrt(6))/(math.sqrt(l_in + l_out))
        return np.random.rand(l_out, l_in + 1) * 2 * self.epsilon_init - self.epsilon_init

    def pack_thetas(self, t1, t2):
        return np.concatenate((t1.reshape(-1), t2.reshape(-1)))

    def unpack_thetas(self, thetas, input_layer_size, hidden_layer_size, num_labels):
        t1_start = 0
        t1_end = hidden_layer_size * (input_layer_size + 1)
        t1 = thetas[t1_start:t1_end].reshape((hidden_layer_size, input_layer_size + 1))
        t2 = thetas[t1_end:].reshape((num_labels, hidden_layer_size + 1))
        return t1, t2

    def _forward(self, X, t1, t2):
        m = X.shape[0]
        ones = None
        if len(X.shape) == 1:
            ones = np.array(1).reshape(1,)
        else:
            ones = np.ones(m).reshape(m,1)

        # Input layer
        a1 = np.hstack((ones, X))

        # Hidden Layer
        z2 = np.dot(t1, a1.T)
        a2 = self.activation_func(z2)
        a2 = np.hstack((ones, a2.T))

        # Output layer
        z3 = np.dot(t2, a2.T)
        a3 = self.activation_func(z3)
        return a1, z2, a2, z3, a3

    def function(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)

        m = X.shape[0]
        Y = np.eye(num_labels)[y]

        _, _, _, _, h = self._forward(X, t1, t2)
        costPositive = -Y * np.log(h).T
        costNegative = (1 - Y) * np.log(1 - h).T
        cost = costPositive - costNegative
        J = np.sum(cost) / m

        if reg_lambda != 0:
            t1f = t1[:, 1:]
            t2f = t2[:, 1:]
            reg = (self.reg_lambda / (2 * m)) * (self.sumsqr(t1f) + self.sumsqr(t2f))
            J = J + reg
        return J

    def function_prime(self, thetas, input_layer_size, hidden_layer_size, num_labels, X, y, reg_lambda):
        t1, t2 = self.unpack_thetas(thetas, input_layer_size, hidden_layer_size, num_labels)

        m = X.shape[0]
        t1f = t1[:, 1:]
        t2f = t2[:, 1:]
        Y = np.eye(num_labels)[y]

        Delta1, Delta2 = 0, 0
        for i, row in enumerate(X):
            a1, z2, a2, z3, a3 = self._forward(row, t1, t2)

            # Backprop
            d3 = a3 - Y[i, :].T
            d2 = np.dot(t2f.T, d3) * self.activation_func_prime(z2)

            Delta2 += np.dot(d3[np.newaxis].T, a2[np.newaxis])
            Delta1 += np.dot(d2[np.newaxis].T, a1[np.newaxis])

        Theta1_grad = (1 / m) * Delta1
        Theta2_grad = (1 / m) * Delta2

        if reg_lambda != 0:
            Theta1_grad[:, 1:] = Theta1_grad[:, 1:] + (reg_lambda / m) * t1f
            Theta2_grad[:, 1:] = Theta2_grad[:, 1:] + (reg_lambda / m) * t2f

        return self.pack_thetas(Theta1_grad, Theta2_grad)

    def fit(self, X, y):
        num_features = X.shape[0]
        input_layer_size = X.shape[1]
        num_labels = len(set(y))

        theta1_0 = self.rand_init(input_layer_size, self.hidden_layer_size)
        theta2_0 = self.rand_init(self.hidden_layer_size, num_labels)
        thetas0 = self.pack_thetas(theta1_0, theta2_0)

        options = {'maxiter': self.maxiter}
        _res = optimize.minimize(self.function, thetas0, jac=self.function_prime, method=self.method, 
                                 args=(input_layer_size, self.hidden_layer_size, num_labels, X, y, 0), options=options)

        self.t1, self.t2 = self.unpack_thetas(_res.x, input_layer_size, self.hidden_layer_size, num_labels)

        np.savetxt("weights_t1.txt", self.t1, newline="
")
        np.savetxt("weights_t2.txt", self.t2, newline="
")

    def predict(self, X):
        return self.predict_proba(X).argmax(0)

    def predict_proba(self, X):
        _, _, _, _, h = self._forward(X, self.t1, self.t2)
        return h


##################
# IR data        #
##################
values = np.loadtxt('infrared_data.txt', delimiter=', ', usecols=[0,1,2,3,4])

targets = np.loadtxt('infrared_data.txt', delimiter=', ', dtype=(int), usecols=[5])

X_train, X_test, y_train, y_test = cross_validation.train_test_split(values, targets, test_size=0.4)
nn = NN_1HL()
nn.fit(values, targets)
print("Accuracy of classification: "+str(accuracy_score(y_test, nn.predict(X_test))))

Artem Sobolev · Accepted Answer

In the given code scipy.optimize.minimize iteratively minimizes function given it's derivative (Jacobi's matrix). According to the documentation, use can specify callback argument to a function that will be called after each iteration — this will let you measure performance, though I'm not sure if it'll let you halt the optimization process.

All parameters you listed are hyperparameters, it's hard to optimize them directly:

Number of neurons in the hidden layer is a discrete valued parameters, and, thus, is not optimizable via gradient techniques. Moreover, it affects NeuralNet architecture, so you can't optimize it while training the net. What you can do, though, is to use some higher-level routine to search for possible options, like exhaustive grid search with cross-validation (for example look at GridSearchCV) or other tools for hyperparameter search (hyperopt, spearmint, MOE, etc).

Learning rate does not seem to be customizable for most of the optimization methods available. But, actually, learning rate in gradient descent is just a Newton's method with Hessian "approximated" by 1 / eta I — diagonal matrix with inverted learning rates on the major diagonal. So you can try hessian-based methods with this heuristic.

Momentum is completely unrelated to regularization. It's an optimization technique, and, since you use scipy for optimization, is unavailable for you.

Use of scipy.optimize.minimize in Neural Network

Answers (1)

Related Questions