Artificial Neural Network- why usually use sigmoid activation function in the hidden layer instead of tanh-sigmoid activation function?

Question

why is log-sigmoid activation function the primary selection in the hidden layer instead of tanh-sigmoid activation function? And also, if I use Z-score normalization, could I use sigmoid activation function in the hidden layer?

Jonas Adler · Accepted Answer

Ancient history

The motivation for using the sigmoid function was historically physically motivated. The first neural networks, in the very early days, in fact used the step function

The motivation was that this is how neurons work in the brain, at least to the understanding of that time. At a certain fixed activation energy the neuron "activates", going from inactive (0) to active (1). However, these networks are very hard to train, and the standard paradigm was also physically motivated, e.g. "neurons that are used often, get a stronger connection". This worked for very small networks, but did not scale at all to larger networks.

Gradient descent and the advent of the sigmoid

In the 80's a slight revolution was had in neural networks when it was discovered that they can be trained using gradient descent. This allowed the networks to scale to much larger scales, but it also spelled the end of the step activation, since it is not differentiable. However, given the long history of the step activation and its plausible physical motivation, people were hesitant to abandon it fully, and hence approximated it by the sigmoid function, which shares many of its characteristics, but is differentiable around 0.

Later on, people started using the tanh function since it is zero centered, which gives somewhat better characteristics in some cases.

The RevoLUtion

Then in 2000, a seminal paper was published in Nature that suggested the use of the the ReLU activation function:

This was motivated by problems with the earlier actiation functions, but most important is speed and the fact that it does not suffer from the vanishing gradient problem. Since then, basically all top neural network research has been using the ReLU activation or slight variations thereof.

The only exception is perhaps recurrent networks, where the output is fed back as input. In these, using the unbounded actiation functions such as the ReLU would quickly lead to an explosion in results, and people still use the sigmoid and/or tanh in these cases.

Artificial Neural Network- why usually use sigmoid activation function in the hidden layer instead of tanh-sigmoid activation function?

Answers (1)

Ancient history

Gradient descent and the advent of the sigmoid

The RevoLUtion

Related Questions