Graphically, how does the non-linear activation function project the input onto the classification space?

Question

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.

Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?

I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.

Nate · Accepted Answer

Activation functions have very little to do with classifying non-linearly separable sets of data.

Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).

You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.

Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.

When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.

Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set: Neural Network Superimposed

I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Graphically, how does the non-linear activation function project the input onto the classification space?

Answers (2)

Related Questions