Reputation: 3806
I'm following along with TF's Getting Started, where we do a simple gradient descent on a linear model, and a small tweak to it causes a problem that I'm using as a test case to learn the TF Debugger. Here's the code from Getting Started:
import tensorflow as tf
from tensorflow.python import debug as tf_debug
sess = tf.Session()
# sess = tf_debug.LocalCLIDebugWrapperSession(sess)
W = tf.Variable([.3], tf.float32)
b = tf.Variable([-.3], tf.float32)
x = tf.placeholder(tf.float32)
model = W * x + b
y = tf.placeholder(tf.float32)
sq_deltas = tf.square(model - y)
loss = tf.reduce_sum(sq_deltas)
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
for i in range(1000):
sess.run(train, {x: list(range(1, 8)),
y: list(range(0, -7, -1))})
out = sess.run([W, b])
Now W
and b
have diverged:
>>> print(out)
[array([nan], dtype=float32), array([nan], dtype=float32)]
Notice in the sess.run(train, ...)
my example uses a dataset of 7 examples, whereas their example uses 4 samples:
{
x : np.array([1., 2., 3., 4.]),
y : np.array([0., -1., -2., -3.])
}
The gradient is diverging, so if I use this, it can correctly solve the problem again, albeit slowly:
optimizer = tf.train.GradientDescentOptimizer(0.001)
So I can jump into debug mode:
python -m mything.py --debug
> run # runs sess.run(init)
> run -f has_inf_or_nan # runs gradient descent, filters for inf/nan
# Square:0 was found to have inf
> pt Square:0 # all at or approaching inf
> ni Square # I'm not sure what do do with this
> ni -t Square # buried in the output, my code line:
# sq_deltas = tf.square(model - y)
But at this point I'm lost, and the debugger is still a bit elusive to me.
1) How do I track down the source of these inf
s, and importantly, 2) what am I doing wrong for this simple linear model to not be able to scale to a larger dataset?
Upvotes: 2
Views: 76
Reputation: 12617
1) How do I track down the source of these infs,
You won't. Barring actual bugs, divergence of optimization is what you might call "emergent behavior". It doesn't happen in a single incorrect step in the code. The parameters simply move farther and farther away from their optimal values.
2) what am I doing wrong for this simple linear model to not be able to scale to a larger dataset?
SGD generally diverges for some values of the learning rate. It's the nature of the algorithm. These values depend on the model, the initial values and on the dataset. You are simply observing the latter.
In your case, the scale of the data changed.
Upvotes: 1