Reputation: 61
I understand this decision depends on the task, but let me explain.
I'm designing a model that predicts steering angles from a given dashboard video frame using a convolutional neural network with dense layers at end. In my final dense layer, I only have a single unit that predicts a steering angle.
My question here is, for my task would either option below show a boost in performance?
Get ground truth steering angles, convert to radians, and squash them using tanh so they are between -1 and 1. In the final dense layer of my network, use a tanh activation function.
Get ground truth steering angles. These raw angles are between -420 and 420 degrees. In the final layer, use a linear activation.
I'm trying to think about it logically, where in option A the loss will likely be much smaller since the network is dealing with much smaller numbers. This would lead to smaller changes in weights.
Let me know your thoughts!
Upvotes: 1
Views: 1320
Reputation: 53768
There are two types of variables in neural networks: weights and biases (mostly, there are additional variables, e.g. the moving mean and moving variance required for batchnorm). They behave a bit differently, for instance biases are not penalized by a regularizer as a result they don't tend to get small. So an assumption that the network is dealing only with small numbers is not accurate.
Still, biases need to be learned, and as can be seen from ResNet performance, it's easier to learn smaller values. In this sense, I'd rather pick [-1, 1]
target range over [-420, 420]
. But tanh
is probably not an optimal activation function:
tahn
(just like with sigmoid
), a saturated neuron kills the gradient during backprop. Choosing tahn
with no specific reason is likely to hurt your training.tahn
need to compute exp
, which is also relatively expensive.My option would be (at least initially, until some other variant proves to work better) to squeeze the ground truth values and have no activation at all (I think that's what you mean by a linear activation): let the network learn [-1, 1]
range by itself.
In general, if you have any activation functions in the hidden layers, ReLu
has proven to work better than sigmoid
, though other modern functions have been proposed recently, e.g. leaky ReLu
, PRelu
, ELU
, etc. You might try any of those.
Upvotes: 2