Reputation: 4285
I'm trying to implement the softmax function for a neural network written in Numpy. Let h be the softmax value of a given signal i.
I've struggled to implement the softmax activation function's partial derivative.
I'm currently stuck at issue where all the partial derivatives approaches 0 as the training progresses. I've cross-referenced my math with this excellent answer, but my math does not seem to work out.
import numpy as np
def softmax_function( signal, derivative=False ):
# Calculate activation signal
e_x = np.exp( signal )
signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
if derivative:
# Return the partial derivation of the activation function
return np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
else:
# Return the activation signal
return signal
#end activation function
The signal
parameter contains the input signal sent into the activation function and has the shape (n_samples, n_features).
# sample signal (3 samples, 3 features)
signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
The following code snipped is a fully working activation function and is only included as a reference and proof (mostly for myself) that the conceptual idea actually work.
from scipy.special import expit
import numpy as np
def sigmoid_function( signal, derivative=False ):
# Prevent overflow.
signal = np.clip( signal, -500, 500 )
# Calculate activation signal
signal = expit( signal )
if derivative:
# Return the partial derivation of the activation function
return np.multiply(signal, 1 - signal)
else:
# Return the activation signal
return signal
#end activation function
Upvotes: 8
Views: 8346
Reputation: 51
@Imanol Luengo's solution is wrong the moment he takes sums across rows.
@Harveyslash also makes a good point since he noticed extremely low "gradients" in his solution, the NN won't learn or learn in the wrong direction.
We have 4 samples x 3 inputs
The thing is that softmax is not a scalar function that takes just 1 input, but 3 in this case. Remember that all inputs in one sample must add up to one, and therefore you can't compute the value of one without knowing the others? This implies that the gradient must be a square matrix, because we need to also take into account our partial derivatives.
TLDR: The output of the gradient of softmax(sample) in this case must be a 3x3 matrix.
This is correct:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
Up until this point Imanol uses fast vector operations to compute the Jacobian of the softmax function in 4 points, resulting in a 3x3 matrix stacked 4 times: 4 x 3 x 3
.
Now I think what the OP really wants is dJdZ, the first step in ANN backprop:
dJdZ(4x3) = dJdy(4x3) * gradSoftmax[layer signal(4x3)](?,?)
The problem is that we usually (with sigmoid, ReLU,... any scalar activation function) can compute the gradient as a stacked vector and then multiply element-wise with dJdy, but here we got a stacked matrix. How can we marry the two concepts?
The vector can be seen as the non zero elements of a diagonal matrix -> All this time we have been getting away with easy element-wise multiplication just because our activation function was scalar! For our softmax it's not that simple, and therefore we have to use matrix multiplication
dJdZ(4x3) = dJdy(4-1x3) * anygradient[layer signal(4,3)](4-3x3)
Now we multiply each 1x3 dJdy
vector times the 3x3 gradient
, for each of the 4
samples, but usually common operations will fail. We need to specify along which dimensions we multiply too (use np.einsum
). End result:
#For reference 'mnr,mrr->mr' = 4x1x3,4x3x3->4x3
dJdZ = np.einsum('mnr,mrr->mr', dJdy[:,None,:], gradient)
Upvotes: 1
Reputation: 15909
This is an answer on how to calculate the derivative of the softmax function in a more vectorized numpy fashion. However, the fact that the partial derivatives approach to zero might not be a math issue, and just be a problem of the learning rate or the known dying weight issue with complex deep neural networks. Layers like ReLU help preventing the latter issue.
First, I've used the following signal (just duplicating your last entry) to make it 4 samples x 3 features
so is easier to see what is going on with the dimensions.
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> signal.shape
(4, 3)
Next, you want to compute the Jacobian matrix of your softmax function. According to the cited page it is defined as -hi * hj
for the off-diagonal entries (majority of the matrix for n_features > 2
), so lets start there. In numpy, you can efficiently calculate that Jacobian matrix using broadcasting:
>>> J = - signal[..., None] * signal[:, None, :]
>>> J.shape
(4, 3, 3)
The first signal[..., None]
(equivalent to signal[:, :, None]
) reshapes the signal to (4, 3, 1)
while the second signal[:, None, :]
reshapes the signal to (4, 1, 3)
. Then, the *
just multiplies both matrices element-wise. Numpy's internal broadcasting repeats both matrices to form the n_features x n_features
matrix for every sample.
Then, we need to fix the diagonal elements:
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
The above lines extract diagonal indices for n_features x n_features
matrix. It is equivalent of doing iy = np.arange(n_features); ix = np.arange(n_features)
. Then, replaces the diagonal entries with your defitinion hi * (1 - hi)
.
Last, according to the linked source, you need to sum across rows for each of the samples. That can be done as:
>>> J = J.sum(axis=1)
>>> J.shape
(4, 3)
Find bellow a summarized version:
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
return J.sum(axis=1) # sum across-rows for each sample
Comparison of the derivatives:
>>> signal = [[0.3394572666491664, 0.3089068053925853, 0.3516359279582483], [0.33932706934615525, 0.3094755563319447, 0.3511973743219001], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256], [0.3394407172182317, 0.30889042266755573, 0.35166886011421256]]
>>> e_x = np.exp( signal )
>>> signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
Yours:
>>> np.multiply( signal, 1 - signal ) + sum(
# handle the off-diagonal values
- signal * np.roll( signal, i, axis = 1 )
for i in xrange(1, signal.shape[1] )
)
array([[ 2.77555756e-17, -2.77555756e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17],
[ 2.77555756e-17, 0.00000000e+00, 2.77555756e-17]])
Mine:
>>> J = signal[..., None] * signal[:, None, :]
>>> iy, ix = np.diag_indices_from(J[0])
>>> J[:, iy, ix] = signal * (1. - signal)
>>> J.sum(axis=1)
array([[ 4.16333634e-17, -1.38777878e-17, 0.00000000e+00],
[ -2.77555756e-17, -2.77555756e-17, -2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17],
[ 2.77555756e-17, 1.38777878e-17, 2.77555756e-17]])
Upvotes: 14