Reputation: 22234
Why the numeric gradient (f(x+k)-f(x-k)) / 2k
of the logistic log loss function f(x) = -np.log(1.0 - __sigmoid(x))
diverges but -np.log(__sigmoid(x))
does not? What are the potential cases and mechanisms or am I making mistakes? The code is at the bottom.
Any suggestion, correction, insights, resources references, or advice/tip/hint on how to implement the numeric gradient will be appreciated.
Trying to implement a numeric gradient (f(x+k)-f(x-k)) / 2k
of the logistic log loss function. y
in the figure is binary true/false label T
and p
is the activation sigmoid(x)
.
When k
is relatively large such as 1e-5
, the issue does not happen, at least in the range of x
.
However when k
gets smaller e.g. 1e-08
, -np.log(1.0 - __sigmoid(x))
started diverging. However, it does not happen to -np.log(__sigmoid(x))
.
Wonder if subtracting 1.0 - sigmoid(x)
has something to do with in relation to how float numbers are stored and calculated in a computer in the binary manner.
The reason trying to make k
smaller is to prevent log(0)
to become np.inf
by adding a small number u
e.g. 1e-5
but log(x+1e-5)
causes a deviation of numerical gradient from the analytical one. To minimize the impact, I try to make it smallest possible and start having this issue.
import numpy as np
import inspect
from itertools import product
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
def __sigmoid(X):
return 1 / (1 + np.exp(-1 * X))
def __logistic_log_loss(X: np.ndarray, T: np.ndarray):
return -(T * np.log(__sigmoid(X)) + (1-T) * np.log(1-__sigmoid(X)))
def __logistic_log_loss_gradient(X, T):
Z = __sigmoid(X)
return Z-T
N = 1000
left=-20
right=20
X = np.linspace(left,right,N)
T0 = np.zeros(N)
T1 = np.ones(N)
# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8,6))
ax.plot(
X,
__logistic_log_loss(X, T1),
color='blue', linestyle='solid',
label="logistic_log_loss(X, T=1)"
)
ax.plot(
X,
__logistic_log_loss_gradient(X, T1),
color='navy', linestyle='dashed',
label="Analytical gradient(T=1)"
)
# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
ax.plot(
X,
__logistic_log_loss(X, T0),
color='magenta', linestyle='solid',
label="logistic_log_loss(X, T=0)"
)
ax.plot(
X,
__logistic_log_loss_gradient(X, T0),
color='purple', linestyle='dashed',
label="Analytical gradient(T=0)"
)
ax.set_xlabel("X")
ax.set_ylabel("dL/dX")
ax.set_title("Logistic log loss and gradient")
ax.legend()
ax.grid(True)
def t_0_loss(X):
return [
#logistic_log_loss(P=sigmoid(x), T=0)
-np.log(1.0 - __sigmoid(x)) for x in X
]
def t_1_loss(X):
return [
#logistic_log_loss(P=sigmoid(x), T=1)
-np.log(__sigmoid(x)) for x in X
]
N = 1000
left=-1
right=15
# Numerical gradient
# (f(x+k)-f(x-k)) / 2k
k = 1e-9
X = np.linspace(left,right,N)
fig, axes = plt.subplots(1, 2, figsize=(10,8))
# --------------------------------------------------------------------------------
# T = 0
# --------------------------------------------------------------------------------
axes[0].plot(
X,
((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k)),
color='red', linestyle='solid',
label="Diffed numerical gradient(T=0)"
)
axes[0].plot(
X[0:-1:20],
((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X))) / k)[0:-1:20],
color='black', linestyle='dotted', marker='x', markersize=4,
label="Left numerical gradient(T=0)"
)
axes[0].plot(
X[0:-1:20],
((np.array(t_0_loss(X)) - np.array(t_0_loss(X - k))) / k)[0:-1:20],
color='salmon', linestyle='dotted', marker='o', markersize=5,
label="Right numerical gradient(T=0)"
)
axes[0].set_xlabel("X")
axes[0].set_ylabel("dL/dX")
axes[0].set_title("T=0: -log(1-sigmoid(x))")
axes[0].legend()
axes[0].grid(True)
# --------------------------------------------------------------------------------
# T = 1
# --------------------------------------------------------------------------------
axes[1].plot(
X,
((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k)),
color='blue', linestyle='solid',
label="Diffed numerical gradient(T=1)"
)
axes[1].plot(
X[0:-1:20],
((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X))) / k)[0:-1:20],
color='cyan', linestyle='dashed', marker='x', markersize=5,
label="Left numerical gradient(T=1)"
)
axes[1].plot(
X[0:-1:20],
((np.array(t_1_loss(X)) - np.array(t_1_loss(X - k))) / k)[0:-1:20],
color='yellow', linestyle='dotted', marker='o', markersize=5,
label="Right numerical gradient(T=1)"
)
axes[1].set_xlabel("X")
axes[1].set_ylabel("dL/dX")
axes[1].set_title("T=1: -log(sigmoid(x)")
axes[1].legend()
axes[1].grid(True)
Upvotes: 2
Views: 638
Reputation: 22234
Solution by Reza.B.
Let z=1/(1+p), p= e^(-x). You can see then log(1-z)=log(p)-log(1+p), which is more stable in terms of rounding errors (we got rid of division, which is the main issue in numerical instabilities).
The errors have been resolved.
def t_0_loss(X):
L = X + np.log(1 + np.exp(-X))
return L.tolist()
def t_1_loss(X):
L = np.log(1 + np.exp(-X))
return L.tolist()
%%timeit
((np.array(t_0_loss(X + k)) - np.array(t_0_loss(X - k))) / (2*k))
((np.array(t_1_loss(X + k)) - np.array(t_1_loss(X - k))) / (2*k))
---
599 µs ± 65.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Previous erroneous version was:
47 ms ± 617 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 0
Reputation: 222352
Whenever a real number is converted to (or computed in) any limited-precision numerical format, there may be some amount of error. Suppose that, in a particular interval, the numerical format is capable of representing values with a precision of one part in P. In other words, the numbers that are representable in the format appear at distances that are about 1/P apart relative to the magnitudes of the numbers.
Then, when a real number is converted to the format, resulting in a representable number, the error (ignoring sign) may be at most ½ of 1/P (relative to the magnitude) if we choose the nearest representable number. It may be smaller, if the real number happens to be on or near a representable number.
Now consider your expression f(x+k)-f(x-k)
. f(x+k)
and f(x-k)
will have some error around ¼ of 1/P, maybe more if they are the results of several calculations, maybe less if you are lucky. But, for a simple model, we can figure the error will be somewhere in the region of 1/P. When we subtract them, the error may still be somewhere in the region of 1/P. The errors in f(x+k)
and f(x-k)
may reinforce or may cancel in the subtraction, so sometimes you will get very little total error, but it will often about somewhere around 1/P.
In your situation, f(x+k)
and f(x-k)
are very near each other. So, when they are subtracted, the result is much smaller in magnitude than they are. That error of around 1/P is relative to the magnitudes of f(x+k)
and f(x-k)
. Since f(x+k)-f(x-k)
is very small compared to f(x+k)
and f(x-k)
, the error of 1/P relative to f(x+k)
and f(x-k)
is much larger relative to f(x+k)-f(x-k)
.
This is the source of most of the noise in your graph.
To avoid it, you need to calculate f(x+k)-f(x-k)
with more precision or you need to avoid that calculation.
Upvotes: 1