dward4
dward4

Reputation: 1932

Building the derivative of Softmax in Tensorflow from a NumPy version

I'm interested in building the derivative of Softmax in Tensorflow, and as a new user I'm stuck.

The closest code I can find is a NumPy version Softmax derivative in NumPy approaches 0 (implementation). Code is below. I am able to translate the softmax portion into tensorflow easily, but I'm stuck as to how apply the derivative section to tensorflow - the three lines under "if derivative" are giving me trouble. How would you go about building the three lines of the derivative portion?

Thank you.

Derivative Portion

if derivative:
    J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
    iy, ix = np.diag_indices_from(J[0])
    J[:, iy, ix] = signal * (1. - signal) # diagonal
    return J.sum(axis=1)

Here is the full code from the link above.

def softmax_function( signal, derivative=False ):
    # Calculate activation signal
    e_x = np.exp( signal )
    signal = e_x / np.sum( e_x, axis = 1, keepdims = True )
    if derivative:
        J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
        iy, ix = np.diag_indices_from(J[0])
        J[:, iy, ix] = signal * (1. - signal) # diagonal
        return J.sum(axis=1)
    else:
        # Return the activation signal
        return signal

Upvotes: 1

Views: 2655

Answers (3)

MSS
MSS

Reputation: 3633

You can write your derivative part directly in tensorflow as follows:

#Assuming signal is a tensor created in tensorflow.
if derivative:
   J = - signal[..., None] * signal[:, None, :] # off-diagonal Jacobian
   J = tf.linalg.set_diag(J, signal * (1. - signal))# diagonal
   return tf.reduce_sum(J,1)

Upvotes: 0

Maxim
Maxim

Reputation: 53768

I've several notes to your code:

  • Tensorflow is based on a computational graph, so there's no need to compute gradients manually. It's totally fine to do it in pure numpy, as an exercise, but just for you to know.

  • Your softmax forward calculation is correct, but possibly numerically unstable. In short, division of very large values can lead to precision loss, so it's better to subtract the maximum value of signal before computing the exponent:

    stable_signal = signal - np.max(signal)
    e_x = np.exp(stable_signal)
    signal = e_x / np.sum(e_x, axis=1, keepdims=True)
    

    See the section "Practical issues: Numeric stability" in this post for more details.

  • Softmax derivative itself is a bit hairy. It is more efficient (and easier) to compute the backward signal from the softmax layer, that is the derivative of cross-entropy loss wrt the signal. To do it, you need to pass the correct labels y as well into softmax_function. Then the computation is the following:

    sums_per_row = np.sum(exponents, axis=1, keepdims=True)  # can also reuse with the forward pass
    all_softmax_matrix = (e_x.T / sums_per_row).T
    grad_coeff = np.zeros_like(signal)
    grad_coeff[np.arange(num_train), y] = -1
    grad_coeff += all_softmax_matrix
    return grad_coeff
    

    If you are not sure why this is so easy, take a look at these brilliant notes.

Upvotes: 2

naveenpitchai
naveenpitchai

Reputation: 98

I was recently in this exact situation and could not find much help, especially on extracting the diagonal elements to update the i=j scenario.

One probable reason could be the lack of numpy like versatility in tensorflow yet.

See if below two alternatives work for you,

  1. Implement the derivative part in numpy and dynamically call the function in the session inside a feed_dict,

define a placeholder in your graph,

derivative = tf.placeholder(tf.float32,[None, num_features])

considering your softmax probabilities are stored in a variable named 'output'

then in your session you could do this,

session.run(train, feed_dict{ derivative: softmax(output.eval(),deriv=True)})

Note: This method could be computationally expensive.

  1. Understanding the maths behind that particular derivative

Loss function,

      L= -1/N  ∑ yk.log⁡(pk)

As per the chain rule,

      ∂L/∂W2 = ∂L/dpi .∂pi/∂a2 .∂a2/dW2

where L is Loss function p is Probability output

By calculating the first 2 terms in the chain rule for two cases, k = i and k != i, will simplify that into below term,

   ∂L/dpi . ∂pi/∂a2 = pi-yi  --- > delta value for the output layer.

So you have can calculate your l2_delta simply by subtracting your target from the softmax output from the front propagation, instead of the complex Jacobian matrix method.

Check out these links to learn more about the math behind this,

Excellent explanation on why softmax derivative is not simple as any other activation --> Derivative of a softmax function

PDF explaining steps of simplification described in the answer, PDF

Upvotes: 2

Related Questions