Reputation: 22234

numpy - why numerical gradient log(1-sigmoid(x)) diverges but log(sigmoid(x)) does not?

Question

Why the numeric gradient (f(x+k)-f(x-k)) / 2k of the logistic log loss function f(x) = -np.log(1.0 - __sigmoid(x)) diverges but -np.log(__sigmoid(x)) does not? What are the potential cases and mechanisms or am I making mistakes? The code is at the bottom.

Any suggestion, correction, insights, resources references, or advice/tip/hint on how to implement the numeric gradient will be appreciated.

Background

Trying to implement a numeric gradient (f(x+k)-f(x-k)) / 2k of the logistic log loss function. y in the figure is binary true/false label T and p is the activation sigmoid(x).

Logistic log loss functions

When k is relatively large such as 1e-5, the issue does not happen, at least in the range of x.

However when k gets smaller e.g. 1e-08, -np.log(1.0 - __sigmoid(x)) started diverging. However, it does not happen to -np.log(__sigmoid(x)).

Wonder if subtracting 1.0 - sigmoid(x) has something to do with in relation to how float numbers are stored and calculated in a computer in the binary manner.

The reason trying to make k smaller is to prevent log(0) to become np.inf by adding a small number u e.g. 1e-5 but log(x+1e-5) causes a deviation of numerical gradient from the analytical one. To minimize the impact, I try to make it smallest possible and start having this issue.

Code

Logistic log loss and analytical gradient.

import numpy as np
import inspect
from itertools import product
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def __sigmoid(X):
    return 1 / (1 + np.exp(-1 * X))

def __logistic_log_loss(X: np.ndarray, T: np.ndarray):
    return -(T * np.log(__sigmoid(X)) + (1-T) * np.log(1-__sigmoid(X)))

def __logistic_log_loss_gradient(X, T):
    Z = __sigmoid(X)
    return Z-T

N = 1000
left=-20
right=20

X = np.linspace(left,right,N)
T0 = np.zeros(N)
T1 = np.ones(N)

# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(
    X,
    __logistic_log_loss(X, T1),
    color='blue', linestyle='solid',
    label="logistic_log_loss(X, T=1)"
)
ax.plot(
    X, 
    __logistic_log_loss_gradient(X, T1),
    color='navy', linestyle='dashed',
    label="Analytical gradient(T=1)"
)

# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
ax.plot(
    X, 
    __logistic_log_loss(X, T0), 
    color='magenta', linestyle='solid',
    label="logistic_log_loss(X, T=0)"
)
ax.plot(
    X, 
    __logistic_log_loss_gradient(X, T0),
    color='purple', linestyle='dashed', 
    label="Analytical gradient(T=0)"
)

ax.set_xlabel("X")
ax.set_ylabel("dL/dX")
ax.set_title("Logistic log loss and gradient")
ax.legend()
ax.grid(True)

Numerical gradient

def t_0_loss(X):
    return [
        #logistic_log_loss(P=sigmoid(x), T=0)
        -np.log(1.0 - __sigmoid(x)) for x in X 
    ]

def t_1_loss(X):
    return [
        #logistic_log_loss(P=sigmoid(x), T=1)
        -np.log(__sigmoid(x)) for x in X 
    ]

N = 1000
left=-1
right=15

# Numerical gradient
# (f(x+k)-f(x-k)) / 2k 
k = 1e-9

X = np.linspace(left,right,N)
fig, axes = plt.subplots(1, 2, figsize=(10,8))

# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
axes[0].plot(
    X,
    ((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k)),
    color='red', linestyle='solid',
    label="Diffed numerical gradient(T=0)"
)
axes[0].plot(
    X[0:-1:20],
    ((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X))) / k)[0:-1:20],
    color='black', linestyle='dotted', marker='x', markersize=4,
    label="Left numerical gradient(T=0)"
)
axes[0].plot(
    X[0:-1:20],
    ((np.array(t_0_loss(X)) - np.array(t_0_loss(X - k))) / k)[0:-1:20],
    color='salmon', linestyle='dotted', marker='o', markersize=5,
    label="Right numerical gradient(T=0)"
)

axes[0].set_xlabel("X")
axes[0].set_ylabel("dL/dX")
axes[0].set_title("T=0: -log(1-sigmoid(x))")
axes[0].legend()
axes[0].grid(True)

# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
axes[1].plot(
    X,
    ((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k)),
    color='blue', linestyle='solid',
    label="Diffed numerical gradient(T=1)"
)
axes[1].plot(
    X[0:-1:20],
    ((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X))) / k)[0:-1:20],
    color='cyan', linestyle='dashed', marker='x', markersize=5,
    label="Left numerical gradient(T=1)"
)
axes[1].plot(
    X[0:-1:20],
    ((np.array(t_1_loss(X)) - np.array(t_1_loss(X - k))) / k)[0:-1:20],
    color='yellow', linestyle='dotted', marker='o', markersize=5,
    label="Right numerical gradient(T=1)"
)

axes[1].set_xlabel("X")
axes[1].set_ylabel("dL/dX")
axes[1].set_title("T=1: -log(sigmoid(x)")
axes[1].legend()
axes[1].grid(True)

Upvotes: 2

Answers (2)

mon

Reputation: 22234

Solution by Reza.B.

Let z=1/(1+p), p= e^(-x). You can see then log(1-z)=log(p)-log(1+p), which is more stable in terms of rounding errors (we got rid of division, which is the main issue in numerical instabilities).

Reformulation

Result

The errors have been resolved.

def t_0_loss(X):
    L = X + np.log(1 + np.exp(-X))
    return L.tolist()

def t_1_loss(X):
    L = np.log(1 + np.exp(-X))
    return L.tolist()

%%timeit
((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k))
((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k))
---
599 µs ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Previous erroneous version was:

47 ms ± 617 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 0

Eric Postpischil

Reputation: 222352

Whenever a real number is converted to (or computed in) any limited-precision numerical format, there may be some amount of error. Suppose that, in a particular interval, the numerical format is capable of representing values with a precision of one part in P. In other words, the numbers that are representable in the format appear at distances that are about 1/P apart relative to the magnitudes of the numbers.

Then, when a real number is converted to the format, resulting in a representable number, the error (ignoring sign) may be at most ½ of 1/P (relative to the magnitude) if we choose the nearest representable number. It may be smaller, if the real number happens to be on or near a representable number.

Now consider your expression f(x+k)-f(x-k). f(x+k) and f(x-k) will have some error around ¼ of 1/P, maybe more if they are the results of several calculations, maybe less if you are lucky. But, for a simple model, we can figure the error will be somewhere in the region of 1/P. When we subtract them, the error may still be somewhere in the region of 1/P. The errors in f(x+k) and f(x-k) may reinforce or may cancel in the subtraction, so sometimes you will get very little total error, but it will often about somewhere around 1/P.

In your situation, f(x+k) and f(x-k) are very near each other. So, when they are subtracted, the result is much smaller in magnitude than they are. That error of around 1/P is relative to the magnitudes of f(x+k) and f(x-k). Since f(x+k)-f(x-k) is very small compared to f(x+k) and f(x-k), the error of 1/P relative to f(x+k) and f(x-k) is much larger relative to f(x+k)-f(x-k).

This is the source of most of the noise in your graph.

To avoid it, you need to calculate f(x+k)-f(x-k) with more precision or you need to avoid that calculation.

Upvotes: 1