Not binary ground truth labels in binary crossentropy?

Question

Does it make sense to use not binary ground truth values for binary crossentropy? is there any formal proof?

Looks like it used in practice: for example in https://blog.keras.io/building-autoencoders-in-keras.html, i.e. mnist images are not binary, but gray images.

Here is code examples:

1.Normal case:

def test_1():
    print('-'*60)

    y_pred = np.array([0.5, 0.5])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([0.0, 1.0])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [0.6931472]

2.Not binary ground truth values case:

def test_2():
    print('-'*60)

    y_pred = np.array([0.0, 1.0])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([0.5, 0.5])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [8.01512]

3.Ground truth values out of [0,1] range:

def test_3():
    print('-'*60)

    y_pred = np.array([0.5, 0.5])
    y_pred = np.expand_dims(y_pred, axis=0)
    y_true = np.array([-2.0, 2.0])
    y_true = np.expand_dims(y_true, axis=0)

    loss = keras.losses.binary_crossentropy(
        K.variable(y_true),
        K.variable(y_pred)
    )

    print("K.eval(loss):", K.eval(loss))

Output:

K.eval(loss): [0.6931472]

For some reason loss in test_1 and test_3 is the same, maybe it's because clipping [-2, 2] to [0, 1] but I can't see clipping code in Keras code. Also it's interesting that for test_1 and test_2 loss value has large difference but in 1st case we have [0.5, 0.5] and [0.0, 1.0] and in 2nd case we have [0.0, 1.0] and [0.5, 0.5], which is the same values but in reversed order.

In Keras binary_crossentropy defined as:

def binary_crossentropy(y_true, y_pred):
    return K.mean(K.binary_crossentropy(y_true, y_pred), axis=-1)


def binary_crossentropy(target, output, from_logits=False):
    """Binary crossentropy between an output tensor and a target tensor.

    # Arguments
        target: A tensor with the same shape as `output`.
        output: A tensor.
        from_logits: Whether `output` is expected to be a logits tensor.
            By default, we consider that `output`
            encodes a probability distribution.

    # Returns
        A tensor.
    """
    # Note: tf.nn.sigmoid_cross_entropy_with_logits
    # expects logits, Keras expects probabilities.
    if not from_logits:
        # transform back to logits
        _epsilon = _to_tensor(epsilon(), output.dtype.base_dtype)
        output = tf.clip_by_value(output, _epsilon, 1 - _epsilon)
        output = tf.log(output / (1 - output))

    return tf.nn.sigmoid_cross_entropy_with_logits(labels=target,
                                                   logits=output)

xdurch0 · Accepted Answer

Yes, it "makes sense" in that the cross-entropy is a measure for the difference between probability distributions. That is, any distributions (over the same sample space of course) -- the case where the target distribution is one-hot is really just a special case, despite how often it is used in machine learning.

In general, if p is your true distribution and q is your model, cross-entropy is minimized for q = p. As such, using cross-entropy as a loss will encourage the model to converge towards the target distribution.

As for the difference between cases 1 and 2: Cross-entropy is not symmetric. It is actually equal to the entropy of the true distribution p plus the KL-divergence between p and q. This implies that it will generally be larger for p closer to uniform (less "one-hot") because such distributions have higher entropy (I suppose the KL-divergence will also be different since it's not symmetric).

As for case 3: This is actually an artifact of using 0.5 as output. It turns out that in the cross-entropy formula, terms will cancel out in exactly such a way that you always get the same result (log(2)) no matter the labels. This will change when you use an output != 0.5; in this case, different labels give you different cross-entropies. For example:

output 0.3, target 2.0 gives cross-entropy of 2.0512707
output 0.3, target -2.0 gives cross-entropy of -1.3379208

The second case actually gives a negative output, which makes no sense. IMHO the fact that the function allows targets outside the range of [0,1] is an oversight; this should result in a crash. The cross-entropy formula works just fine, but the results are meaningless.

I would recommend also reading the wikipedia article on cross-entropy. It's quite short and has some useful information.

Not binary ground truth labels in binary crossentropy?

Answers (1)

Related Questions