Reputation: 45
I'm currently writing my first multilayer neural net with python 3.7 and numpy, and I'm having trouble implementing softmax (I intend to use my network for classification, so having a working implementation of softmax is pretty crucial). I copied this code off of a different thread:
def softmax(x):
return exp(x) / np.sum(exp(x), axis = 0)
I think I have a basic understanding of the intended function of the softmax function; that is, to take a vector and turn its elements into probabilities so that they sum to 1. Please correct my understanding if I'm wrong. I don't quite understand how this code accomplishes that function, but I found similar code on multiple other threads, so I believe it to be correct. Please confirm.
Unfortunately, in none of these threads could I find a clear implementation of the derivative of the softmax function. I understand it to be more complicated than that of most activation functions, and to require more parameters than just x, but I have no idea how to implement it myself. I'm looking for an explanation of what those other parameters are, as well as for an implementation (or mathematical expression) of the derivative of the softmax function.
Upvotes: 0
Views: 586
Reputation: 3764
Answer for how this code accomplishes that function
:
Here, we make use of a concept known as broadcasting
.
When you use the function exp(x)
, then, assuming x
is a vector, you actually perform an operation similar to what can be accomplished by the following code:
exps = []
for i in x:
exps.append(exp(i))
return exps
The above code is the longer version of what broadcasting does automatically here.
As for the implementation of the derivative, that's a bit more complicated, as you say.
An untested implementation for computing the vector of derivatives with respect to every parameter:
def softmax_derivative(X):
# input : a vector X
# output : a vector containing derivatives of softmax(X) wrt every element in X
# List of derivatives
derivs = []
# denominator after differentiation
denom = np.sum(exp(X), axis=0)
for x in X:
# Function of current element based on differentiation result
comm = -exp(x)/(denom**2)
factor = 0
# Added exp of every element except current element
for other in X:
if other==x:
continue
factor += (exp(other))
derivs.append(comm*factor)
return derivs
You can also use broadcasting in the above function, but I think its more clear in this manner.
Upvotes: 1