Reputation: 1932
I'm interested in building the derivative of Softmax in Tensorflow, and as a new user I'm stuck.
The closest code I can find is a NumPy version Softmax derivative in NumPy approaches 0 (implementation). Code is below. I am able to translate the softmax portion into tensorflow easily, but I'm stuck as to how apply the derivative section to tensorflow - the three lines under "if derivative" are giving me trouble. How would you go about building the three lines of the derivative portion?
Thank you.
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
return J.sum(axis=1)
Here is the full code from the link above.
def softmax_function( signal, derivative=False ):
# Calculate activation signal
e_x = np.exp( signal )
signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
iy, ix = np.diag_indices_from(J[0])
J[:, iy, ix] = signal * (1. - signal) # diagonal
return J.sum(axis=1)
else:
# Return the activation signal
return signal
Upvotes: 1
Views: 2655
Reputation: 3633
You can write your derivative part directly in tensorflow as follows:
#Assuming signal is a tensor created in tensorflow.
if derivative:
J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
J = tf.linalg.set_diag(J, signal * (1. - signal))# diagonal
return tf.reduce_sum(J,1)
Upvotes: 0
Reputation: 53768
I've several notes to your code:
Tensorflow is based on a computational graph, so there's no need to compute gradients manually. It's totally fine to do it in pure numpy, as an exercise, but just for you to know.
Your softmax
forward calculation is correct, but possibly numerically unstable. In short, division of very large values can lead to precision loss, so it's better to subtract the maximum value of signal
before computing the exponent:
stable_signal = signal - np.max(signal)
e_x = np.exp(stable_signal)
signal = e_x / np.sum(e_x, axis=1, keepdims=True)
See the section "Practical issues: Numeric stability" in this post for more details.
Softmax derivative itself is a bit hairy. It is more efficient (and easier) to compute the backward signal from the softmax layer, that is the derivative of cross-entropy loss wrt the signal. To do it, you need to pass the correct labels y
as well into softmax_function
. Then the computation is the following:
sums_per_row = np.sum(exponents, axis=1, keepdims=True) # can also reuse with the forward pass
all_softmax_matrix = (e_x.T / sums_per_row).T
grad_coeff = np.zeros_like(signal)
grad_coeff[np.arange(num_train), y] = -1
grad_coeff += all_softmax_matrix
return grad_coeff
If you are not sure why this is so easy, take a look at these brilliant notes.
Upvotes: 2
Reputation: 98
I was recently in this exact situation and could not find much help, especially on extracting the diagonal elements to update the i=j scenario.
One probable reason could be the lack of numpy like versatility in tensorflow yet.
See if below two alternatives work for you,
define a placeholder in your graph,
derivative = tf.placeholder(tf.float32,[None, num_features])
considering your softmax probabilities are stored in a variable named 'output'
then in your session you could do this,
session.run(train, feed_dict{ derivative: softmax(output.eval(),deriv=True)})
Note: This method could be computationally expensive.
Loss function,
L= -1/N ∑ yk.log(pk)
As per the chain rule,
∂L/∂W2 = ∂L/dpi .∂pi/∂a2 .∂a2/dW2
where L is Loss function p is Probability output
By calculating the first 2 terms in the chain rule for two cases, k = i and k != i, will simplify that into below term,
∂L/dpi . ∂pi/∂a2 = pi-yi --- > delta value for the output layer.
So you have can calculate your l2_delta simply by subtracting your target from the softmax output from the front propagation, instead of the complex Jacobian matrix method.
Check out these links to learn more about the math behind this,
Excellent explanation on why softmax derivative is not simple as any other activation --> Derivative of a softmax function
PDF explaining steps of simplification described in the answer, PDF
Upvotes: 2