mesllo
mesllo

Reputation: 583

Graphically, how does the non-linear activation function project the input onto the classification space?

I am finding a very hard time to visualize how the activation function actually manages to classify non-linearly separable training data sets.

Why does the activation function (e.g tanh function) work for non-linear cases? What exactly happens mathematically when the activation function projects the input to output? What separates training samples of different classes, and how does this work if one had to plot this process graphically?

I've tried looking for numerous sources, but what exactly makes the activation function actually work for classifying training samples in a neural network, I just cannot grasp easily and would like to be able to picture this in my mind.

Upvotes: 1

Views: 970

Answers (2)

Artem Sobolev
Artem Sobolev

Reputation: 6079

Mathematical result behind neural networks is Universal Approximation Theorem. Basically, sigmoidal functions (those which saturate on both ends, like tanh) are smooth almost-piecewise-constant approximators. The more neurons you have – the better your approximation is.

Piece-wise linear approximation

This picture was taked from this article: A visual proof that neural nets can compute any function. Make sure to check that article, it has other examples and interactive applets.

NNs actually, at each level, create new features by distorting input space. Non-linear functions allow you to change "curvature" of target function, so further layers have chance to make it linear-separable. If there were no non-linear functions, any combination of linear function is still linear, thus no benefit from multi-layerness. As a graphical example consider this animation

enter image description here

This pictures where taken from this article. Also check out that cool visualization applet.

Upvotes: 2

Nate
Nate

Reputation: 2205

Activation functions have very little to do with classifying non-linearly separable sets of data.

Activation functions are used as a way to normalize signals at every step in your neural network. They typically have an infinite domain and a finite range. Tanh, for example, has a domain of (-∞,∞) and a range of (-1,1). The sigmoid function maps the same domain to (0,1).

You can think of this as a way of enforcing equality across all of your learned features at a given neural layer (a.k.a. feature scaling). Since the input domain is not known before hand it's not as simple as regular feature scaling (for linear regression) and thusly activation functions must be used. The effects of the activation function are compensated for when computing errors during back-propagation.

Back-propagation is a process that applies error to the neural network. You can think of this as a positive reward for the neurons that contributed to the correct classification and a negative reward for the neurons that contributed to an incorrect classification. This contribution is often known as the gradient of the neural network. The gradient is, effectively, a multi-variable derivative.

When back-propagating the error, each individual neuron's contribution to the gradient is the activations function's derivative at the input value for that neuron. Sigmoid is a particularly interesting function because its derivative is extremely cheap to compute. Specifically s'(x) = 1 - s(x); it was designed this way.

Here is an example image (found by google image searching: neural network classification) that demonstrates how a neural network might be superimposed on top of your data set: Neural Network Superimposed

I hope that gives you a relatively clear idea of how neural networks might classify non-linearly separable datasets.

Upvotes: 0

Related Questions