Reputation: 1936
When I use keras's binary_crossentropy
as the loss function (that calls tensorflow's sigmoid_cross_entropy, it seems to produce loss values only between [0, 1]
. However, the equation itself
# The logistic loss formula from above is
# x - x * z + log(1 + exp(-x))
# For x < 0, a more numerically stable formula is
# -x * z + log(1 + exp(x))
# Note that these two expressions can be combined into the following:
# max(x, 0) - x * z + log(1 + exp(-abs(x)))
# To allow computing gradients at zero, we define custom versions of max and
# abs functions.
zeros = array_ops.zeros_like(logits, dtype=logits.dtype)
cond = (logits >= zeros)
relu_logits = array_ops.where(cond, logits, zeros)
neg_abs_logits = array_ops.where(cond, -logits, logits)
return math_ops.add(
relu_logits - logits * labels,
math_ops.log1p(math_ops.exp(neg_abs_logits)), name=name)
implies that the range is from [0, infinity)
. So is Tensorflow doing some sort of clipping that I'm not catching? Moreover, since it's doing math_ops.add()
I'd assume it'd be for sure greater than 1. Am I right to assume that loss range can definitely exceed 1?
Upvotes: 6
Views: 1889
Reputation: 10474
The cross entropy function is indeed not bounded upwards. However it will only take on large values if the predictions are very wrong. Let's first look at the behavior of a randomly initialized network.
With random weights, the many units/layers will usually compound to result in the network outputing approximately uniform predictions. That is, in a classification problem with n
classes you will get probabilities of around 1/n
for each class (0.5 in the two-class case). In this case, the cross entropy will be around the entropy of an n-class uniform distribution, which is log(n)
, under certain assumptions (see below).
This can be seen as follows: The cross entropy for a single data point is -sum(p(k)*log(q(k)))
where p
are the true probabilities (labels), q
are the predictions, k
are the different classes and the sum is over the classes. Now, with hard labels (i.e. one-hot encoded) only a single p(k)
is 1, all others are 0. Thus, the term reduces to -log(q(k))
where k
is now the correct class. If with a randomly initialized network q(k) ~ 1/n
, we get -log(1/n) = log(n)
.
We can also go of the definition of the cross entropy which is generally entropy(p) + kullback-leibler divergence(p,q)
. If p
and q
are the same distributions (e.g. p
is uniform when we have the same number of examples for each class, and q
is around uniform for random networks) then the KL divergence becomes 0 and we are left with entropy(p)
.
Now, since the training objective is usually to reduce cross entropy, we can think of log(n)
as a kind of worst-case value. If it ever gets higher, there is probably something wrong with your model. Since it looks like you only have two classes (0 and 1), log(2) < 1
and so your cross entropy will generally be quite small.
Upvotes: 4