Reputation: 21632
Does it make sense to use not binary ground truth values for binary crossentropy? is there any formal proof?
Looks like it used in practice: for example in https://blog.keras.io/building-autoencoders-in-keras.html, i.e. mnist images are not binary, but gray images.
Here is code examples:
1.Normal case:
def test_1():
print('-'*60)
y_pred = np.array([0.5, 0.5])
y_pred = np.expand_dims(y_pred, axis=0)
y_true = np.array([0.0, 1.0])
y_true = np.expand_dims(y_true, axis=0)
loss = keras.losses.binary_crossentropy(
K.variable(y_true),
K.variable(y_pred)
)
print("K.eval(loss):", K.eval(loss))
Output:
K.eval(loss): [0.6931472]
2.Not binary ground truth values case:
def test_2():
print('-'*60)
y_pred = np.array([0.0, 1.0])
y_pred = np.expand_dims(y_pred, axis=0)
y_true = np.array([0.5, 0.5])
y_true = np.expand_dims(y_true, axis=0)
loss = keras.losses.binary_crossentropy(
K.variable(y_true),
K.variable(y_pred)
)
print("K.eval(loss):", K.eval(loss))
Output:
K.eval(loss): [8.01512]
3.Ground truth values out of [0,1] range:
def test_3():
print('-'*60)
y_pred = np.array([0.5, 0.5])
y_pred = np.expand_dims(y_pred, axis=0)
y_true = np.array([-2.0, 2.0])
y_true = np.expand_dims(y_true, axis=0)
loss = keras.losses.binary_crossentropy(
K.variable(y_true),
K.variable(y_pred)
)
print("K.eval(loss):", K.eval(loss))
Output:
K.eval(loss): [0.6931472]
For some reason loss in test_1
and test_3
is the same, maybe it's because clipping [-2, 2] to [0, 1] but I can't see clipping code in Keras code.
Also it's interesting that for test_1
and test_2
loss value has large difference but in 1st case we have [0.5, 0.5] and [0.0, 1.0] and in 2nd case we have [0.0, 1.0] and [0.5, 0.5], which is the same values but in reversed order.
In Keras binary_crossentropy
defined as:
def binary_crossentropy(y_true, y_pred):
return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)
def binary_crossentropy(target, output, from_logits=False):
"""Binary crossentropy between an output tensor and a target tensor.
# Arguments
target: A tensor with the same shape as `output`.
output: A tensor.
from_logits: Whether `output` is expected to be a logits tensor.
By default, we consider that `output`
encodes a probability distribution.
# Returns
A tensor.
"""
# Note: tf.nn.sigmoid_cross_entropy_with_logits
# expects logits, Keras expects probabilities.
if not from_logits:
# transform back to logits
_epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
output = tf.log(output / (1 - output))
return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
logits=output)
Upvotes: 3
Views: 1463
Reputation: 10474
Yes, it "makes sense" in that the cross-entropy is a measure for the difference between probability distributions. That is, any distributions (over the same sample space of course) -- the case where the target distribution is one-hot is really just a special case, despite how often it is used in machine learning.
In general, if p
is your true distribution and q
is your model, cross-entropy is minimized for q = p
. As such, using cross-entropy as a loss will encourage the model to converge towards the target distribution.
As for the difference between cases 1 and 2: Cross-entropy is not symmetric. It is actually equal to the entropy of the true distribution p
plus the KL-divergence between p
and q
. This implies that it will generally be larger for p
closer to uniform (less "one-hot") because such distributions have higher entropy (I suppose the KL-divergence will also be different since it's not symmetric).
As for case 3: This is actually an artifact of using 0.5 as output
. It turns out that in the cross-entropy formula, terms will cancel out in exactly such a way that you always get the same result (log(2)
) no matter the labels. This will change when you use an output != 0.5; in this case, different labels give you different cross-entropies. For example:
output
0.3, target
2.0 gives cross-entropy of 2.0512707output
0.3, target
-2.0 gives cross-entropy of -1.3379208The second case actually gives a negative output, which makes no sense. IMHO the fact that the function allows targets outside the range of [0,1] is an oversight; this should result in a crash. The cross-entropy formula works just fine, but the results are meaningless.
I would recommend also reading the wikipedia article on cross-entropy. It's quite short and has some useful information.
Upvotes: 2