Why is the code for a neural network with a sigmoid so different than the code with softmax_cross_entropy_with_logits?

Question

When using neural networks for classification, it is said that:

You generally want to use softmax cross-entropy output, as this gives you the probability of each of the possible options.
In the common case where there are only two options, you want to use sigmoid, which is the same thing except avoids redundantly outputting p and 1-p.

The way to calculate softmax cross entropy in TensorFlow seems to be along the lines of:

cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction,labels=y))

So the output can be connected directly to the minimization code, which is good.

The code I have for sigmoid output, likewise based on various tutorials and examples, is along the lines of:

p = tf.sigmoid(tf.squeeze(...))
cost = tf.reduce_mean((p - y)**2)

I would have thought the two should be similar in form since they are doing the same jobs in almost the same way, but the above code fragments look almost completely different. Furthermore, the sigmoid version is explicitly squaring the error whereas the softmax isn't. (Is the squaring happening somewhere in the implementation of softmax, or is something else going on?)

Is one of the above simply incorrect, or is there a reason why they need to be completely different?

mr_mo · Accepted Answer

The soft-max cross-entropy cost and the square loss cost of a sigmoid are completely different cost functions. Though they seem to be closely related, it is not the same thing.

It is true that both functions are "doing the same job" if the job is defined as "be the output layer of a classification network". Similarly, you can say that "softmax regression and neural networks are doing the same job". It is true, both techniques are trying to classify things, but in a different way.

The softmax layer with cross-entropy cost is usually preferred over sigmoids with l2-loss. Softmax with cross-entropy has its own pros, such as a stronger gradient of the output layer and normalization to probability vector, whereas the derivatives of the sigmoids with l2-loss are weaker. You can find plenty of explanations in this beautiful book.

Why is the code for a neural network with a sigmoid so different than the code with softmax_cross_entropy_with_logits?

Answers (1)

Related Questions