Reputation: 47
I've been working on Stamford's Deep Learning Tutorial and I'm having an issue with one of the exercises, the neural network with the softmax output layer. Here is my implementation in R:
train <- function(training.set, labels, costFunc, activationFunc, outputActivationFunc, activationDerivative, hidden.unit.count = 7, learningRate = 0.3, decayRate=0.02, momentumRate=0.02, samples.count, batch.size, verbose=F, debug=F){
#initialize weights and biases
w1 <- matrix( rnorm(hidden.unit.count * input.unit.count, sd=0.5), nrow=hidden.unit.count, ncol=input.unit.count)
b1 <- matrix(-1, nrow=hidden.unit.count, ncol=1)
w2 <- matrix(rnorm(output.unit.count * hidden.unit.count, sd=0.5), nrow=output.unit.count, ncol=hidden.unit.count)
b2 <- matrix(-1, nrow=output.unit.count, ncol=1)
cost.list<- matrix(rep(seq(1:floor(samples.count / batch.size)), each=2), byrow=T, ncol=2)
cost.list[, 2] <- 0
i <- 1
while(i < samples.count){
z2 <- w1 %*% training.set[, i: (i + batch.size - 1)] + matrix(rep(b1, each=batch.size), ncol=batch.size,byrow=T)
a2 <- activationFunc(z2)
z3 <- w2 %*% a2 + matrix(rep(b2, each=batch.size), ncol=batch.size,byrow=T)
h <- outputActivationFunc(z3)
#calculate error
output.error <- (h - labels[, i: (i + batch.size - 1)])
hidden.error <- (t(w2) %*% output.error) * sigmoidPrime(z2)
# calculate gradients for both layers
gradW2 <- hidden.error %*% t(training.set[ ,i: (i + batch.size - 1)]) - momentumRate * gradW2.prev - decayRate * w1
gradw2 <- output.error %*% t(a2) - momentumRate * gradw2.prev - decayRate * w2
gradW2.prev <- gradW2
gradw2.prev <- gradw2
#update weights and biases
w1 <- w1 - learningRate * gradW2 / batch.size
w2 <- w2 - learningRate * gradW3 / batch.size
b1 <- b1 - learningRate * rowSums(gradW2) / batch.size
b2 <- b2 - learningRate * rowSums(gradW3) / batch.size
i <- i + batch.size
}
return (list(w1, w2, b1, b2, cost.list))
}
Here is the softmax function I use on the output layer and also the cost function I use with softmax:
softmax <- function(a){
a <- a - apply(a, 1, function(row){
return (max(row))
})
a <- exp(a)
return (sweep(a, 2, colSums(a), FUN='/'))
}
softmaxCost <- function(w, b, x, y, decayRate, batch.size){
a <- w %*% x + matrix(rep(b, each=dim(x)[2]), byrow = T, ncol=dim(x)[2])
h <- softmax(a)
cost <- -1/batch.size * (sum(y * log(h))) + decayRate/2 * sum((w * w))
return (cost)
}
I've checked the gradients computed by my program against numerical gradients and they are different. However, I can't find the source of the incorrect gradient calculation.
Also, I've successfully used this network using sigmoid activation at the output layer on the MNIST whereas the using the softmax layer simply doesn't work(11% accuracy). This leads me to believe that the issue lies in my softmax implementation.
Upvotes: 1
Views: 922
Reputation: 2131
If I understand correct, I think the problem in max
part of your code (ReLu).
In DNN of softmax
, we select max(0, value)
. Specifically, in this case, we do this for each element of the matrix a
.
So the code will look like:
# XW + b
hidden.layer <- sweep(X %*% W ,1, b, '+', check.margin = F)
# max for each element in maxtir
hidden.layer <- pmax(hidden.layer, 0)
BTW, you can use sweep
to add b
into matrix instead of duplication of T
rows which will waste lots of memory. Three approaches are shown in here.
Careful about below code, y
should be 0/1
, the correct label w/ 1
and others are 0
, so that you can get the correct loss by sum(y * log(h))
.
cost <- -1/batch.size * (sum(y * log(h))) + decayRate/2 * sum((w * w))
Edit : I have written a blog about how to build DNN with R in here.
Upvotes: 2