Devjeet Roy
Devjeet Roy

Reputation: 47

Neural Net with softmax output failing to converge

I've been working on Stamford's Deep Learning Tutorial and I'm having an issue with one of the exercises, the neural network with the softmax output layer. Here is my implementation in R:

train <- function(training.set, labels, costFunc, activationFunc, outputActivationFunc, activationDerivative, hidden.unit.count = 7, learningRate = 0.3, decayRate=0.02, momentumRate=0.02, samples.count, batch.size, verbose=F, debug=F){

  #initialize weights and biases
  w1 <- matrix( rnorm(hidden.unit.count * input.unit.count, sd=0.5), nrow=hidden.unit.count, ncol=input.unit.count)
  b1 <- matrix(-1, nrow=hidden.unit.count, ncol=1)
  w2 <- matrix(rnorm(output.unit.count * hidden.unit.count, sd=0.5), nrow=output.unit.count, ncol=hidden.unit.count)
  b2 <- matrix(-1, nrow=output.unit.count, ncol=1)

  cost.list<- matrix(rep(seq(1:floor(samples.count / batch.size)), each=2), byrow=T, ncol=2)
  cost.list[, 2] <- 0

  i <- 1
  while(i < samples.count){
    z2 <- w1 %*% training.set[, i: (i + batch.size - 1)] + matrix(rep(b1, each=batch.size), ncol=batch.size,byrow=T)
    a2 <- activationFunc(z2)

    z3 <- w2 %*% a2 + matrix(rep(b2, each=batch.size), ncol=batch.size,byrow=T)
    h  <- outputActivationFunc(z3)

    #calculate error
    output.error <- (h - labels[, i: (i + batch.size - 1)]) 
    hidden.error <- (t(w2) %*% output.error) * sigmoidPrime(z2)

    # calculate gradients for both layers
    gradW2 <- hidden.error %*% t(training.set[ ,i: (i + batch.size - 1)]) - momentumRate * gradW2.prev - decayRate * w1
    gradw2 <- output.error %*% t(a2) - momentumRate * gradw2.prev - decayRate * w2

    gradW2.prev <- gradW2
    gradw2.prev <- gradw2

    #update weights and biases
    w1 <- w1 - learningRate * gradW2 / batch.size
    w2 <- w2 - learningRate * gradW3 / batch.size

    b1 <- b1 - learningRate * rowSums(gradW2) / batch.size
    b2 <- b2 - learningRate * rowSums(gradW3) / batch.size

    i <- i + batch.size
  }

  return (list(w1, w2, b1, b2, cost.list))
}

Here is the softmax function I use on the output layer and also the cost function I use with softmax:

softmax <- function(a){
  a <- a - apply(a, 1, function(row){ 
      return (max(row))
  })

  a <- exp(a)

  return (sweep(a, 2, colSums(a), FUN='/'))
}

softmaxCost <- function(w, b, x, y, decayRate, batch.size){
  a <- w %*% x + matrix(rep(b, each=dim(x)[2]), byrow = T, ncol=dim(x)[2])

  h <- softmax(a)

  cost <- -1/batch.size * (sum(y * log(h))) + decayRate/2 * sum((w * w))

  return (cost)
}

I've checked the gradients computed by my program against numerical gradients and they are different. However, I can't find the source of the incorrect gradient calculation.

Also, I've successfully used this network using sigmoid activation at the output layer on the MNIST whereas the using the softmax layer simply doesn't work(11% accuracy). This leads me to believe that the issue lies in my softmax implementation.

Upvotes: 1

Views: 922

Answers (1)

Patric
Patric

Reputation: 2131

If I understand correct, I think the problem in max part of your code (ReLu). In DNN of softmax, we select max(0, value). Specifically, in this case, we do this for each element of the matrix a.

So the code will look like:

# XW + b
hidden.layer <- sweep(X %*% W ,1, b, '+', check.margin = F)
# max for each element in maxtir
hidden.layer <- pmax(hidden.layer, 0)

BTW, you can use sweep to add b into matrix instead of duplication of T rows which will waste lots of memory. Three approaches are shown in here.

Careful about below code, y should be 0/1, the correct label w/ 1 and others are 0, so that you can get the correct loss by sum(y * log(h)).

cost <- -1/batch.size * (sum(y * log(h))) + decayRate/2 * sum((w * w))

Edit : I have written a blog about how to build DNN with R in here.

Upvotes: 2

Related Questions