Amar Parajuli
Amar Parajuli

Reputation: 45

Softmax function defined in Tensorflow Github repository

I am going through the GitHub source code for Softmax activation function. I have a few questions regarding the code.

  1. m = x.max(1)[:, np.newaxis] has been used to find the maximum in the array provided. What is the need for np.newaxis in this expression?
  2. u = np.exp(x - m) has been used but to my knowledge, it should have been u = np.exp(x). What implementational detail am I missing?
  3. z = u.sum(1)[:, np.newaxis]. Similar to earlier this code also uses np.newaxis. What is its use here?

For your better understanding here is the link to the github repo where this function is defined.

Upvotes: 1

Views: 106

Answers (1)

nonin
nonin

Reputation: 724

The function under discussion is as follows:

def softmax(x):
    assert len(x.shape) == 2
    m = x.max(1)[:, np.newaxis]
    u = np.exp(x - m)
    z = u.sum(1)[:, np.newaxis]
    return u / z

As the assert statement suggets, the softmax function must be applied to a 2D array; and so that all rows of the result u / z sum to one. That's why the methods max and sum are applied row-wise, i.e. with parameter axis=1.

Broadcasting and np.newaxis

For each row x[i] of x, we want to compute np.exp(x[i]) / np.sum(np.exp(x[i])). Here the normalization term np.sum(np.exp(x[i])) is a number, while the term np.exp(x[i]) is a 1D array. Thanks to numpy's broadcasting rules, the operation can be performed.

Now, iterating on the rows of x can be avoided thanks to numpy. Let's take as an example the following array for np.exp(x).

u = np.array([[ 9,  6, 13, 19,  8],
              [ 2, 17, 18,  0, 13],
              [ 8,  3,  2, 18, 10]])  # np.exp(x)
u.sum(axis=1)  # normalization term: array([55, 50, 41])

The aim is to divide each row of u by the corresponding value of the normalization term u.sum(axis=1). However, broadcasting rules do not allow the two terms to be divided directly, since u has shape (3, 5), while the normalization array has shape (3,). As numpy's documentation indicates :

Two dimensions are compatible when

  1. they are equal, or
  2. one of them is 1

So u can be divided by arrays of shape (3, 5), (1, 5), (3, 1), (5,) or () but not by u.sum(1) of shape (3,).

That's why the index operator newaxis is used to insert a new axis into the normalization term, making it two-dimensional with shape (3, 1).

u.sum(axis=1)[:, np.newaxis]  # array([[55], [50], [41]])

Finally a softmax function on rows would be

def softmax(x):
     assert x.dim == 2
     u = np.exp(x)
     z = u.sum(axis=1)[:, np.newaxis]
     return u / z

Numeric stability

However applying this function on large values can be numerically unstable, since np.exp(x) may be very large. Note that subtracting of adding any constant will not change the result thanks to the normalization term.

no effect when adding a constant

That's why the maxima of each row m are subtracted, so that all values are below zero before applying the exponential function.

Upvotes: 4

Related Questions