Why the cost function and the last activation function are bound in MXNet?

Question

When we define a deep learning model, we do the following steps:

Specify how the output should be calculated based on the input and the model's parameters.
Specify a cost (loss) function.
Search for the model's parameters by minimizing the cost function.

It looks to me that in MXNet the first two steps are bound. For example, in the following way I define a linear transformation:

# declare a symbolic variable for the model's input
inp = mx.sym.Variable(name = 'inp')
# define how output should be determined by the input
out = mx.sym.FullyConnected(inp, name = 'out', num_hidden = 2)

# specify input and model's parameters
x = mx.nd.array(np.ones(shape = (5,3)))
w = mx.nd.array(np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]]))
b = mx.nd.array(np.array([7.0, 8.0]))

# calculate output based on the input and parameters
p = out.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b})
print(p.forward()[0].asnumpy())

Now, if I want to add a SoftMax transformation on top of it, I need to do the following:

# define the cost function
target = mx.sym.Variable(name = 'target')
cost = mx.symbol.SoftmaxOutput(out, target, name='softmax')

y = mx.nd.array(np.array([[1.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 0.0], [0.0, 1.0]]))
c = cost.bind(ctx = mx.cpu(), args = {'inp':x, 'out_weight':w, 'out_bias':b, 'target':y})
print(c.forward()[0].asnumpy())

What I do not understand, is why do we need to create the symbolic variable target. We would need it only if we want to calculate costs, but so far, we just calculate output based on the input (by doing a linear transformation and SoftMax).

Moreover, we need to provide a numerical value for the target to get the output calculated. So, it looks like it is required but it is not used (the provided value of the target does not change the value of the output).

Finally, we can use the cost object to define a model which we can fit as soon as we have data. But what about the cost function? It has to be specified, but it is not. Basically, it looks like I am forced to use a specific cost bunction just because I use SoftMax. But why?

ADDED

For more statistical / mathematical point of view check here. Although the current question is more pragmatic / programmatic in nature. It is basically: How to decouple the output nonlinearity and the cost function in MXNEt. For example I might want to do a linear transformation and then find the model parameters by minimizing absolute deviation instead of squared one.

Sina Afrooze · Accepted Answer

You can use mx.sym.softmax() if you only want softmax. mx.sym.SoftmaxOutput() contains efficient code for calculating gradient of cross-entropy (negative log loss), which is the most common loss used with softmax. If you want to use your own loss, just use softmax and add a loss on top during training. I should note that you could also replace the SoftmaxOutput layer with a simple softmax during inference if you really want to.

Why the cost function and the last activation function are bound in MXNet?

Answers (1)

Related Questions