andre_bauer
andre_bauer

Reputation: 850

Tensorflow shuffle batch fraction unexpected behavior

I am training a convolutional neural network and I got some unexpected behavior with the shuffle_batch fraction summary, or maybe I just do not understand it. Can someone pls explain it? The difference between those two graphs is that I exchanged the loss function.

With this loss function I get the line at 0.0

loss = tf.nn.l2_loss(expected_labels-labels)

While this one gives me a constant 1.0 (after hitting 1.0 the first time)

loss = tf.reduce_mean(tf.square(expected_labels - labels))

Can the change of loss function really cause that change? I am not sure what this means.

plot

EDIT: Code as requested The first part is for setting up the batching and the big picture.

filename_queue = tf.train.string_input_producer(filenames,
                                                num_epochs=None)
label, image = read_and_decode_single_example(filename_queue=filename_queue)
image = tf.image.decode_jpeg(image.values[0], channels=3)
jpeg = tf.cast(image, tf.float32) / 255.
jpeg.set_shape([66,200,3])
images_batch, labels_batch = tf.train.shuffle_batch(
    [jpeg, label], batch_size= FLAGS.batch_size,
    num_threads=8,
    capacity=60000,
    min_after_dequeue=10000)
images_placeholder, labels_placeholder = placeholder_inputs(
    FLAGS.batch_size)

label_estimations, W1_conv, h1_conv, current_images = e2e.inference(images_placeholder)

# Add to the Graph the Ops for loss calculation.
loss = e2e.loss(label_estimations, labels_placeholder)


# Decay once per epoch, using an exponential schedule starting at 0.01.


# Add to the Graph the Ops that calculate and apply gradients.
train_op = e2e.training(loss, FLAGS.learning_rate, FLAGS.batch_size)

Here come the methods for inference loss and train

def inference(images):
with tf.name_scope('conv1'):
    W_conv1 = tf.Variable(tf.truncated_normal([5, 5, 3, FEATURE_MAPS_C1], stddev=STDDEV))
    b_conv1 = tf.Variable(tf.constant(BIAS_INIT, shape=[FEATURE_MAPS_C1]))
    h_conv1 = tf.nn.bias_add(
        tf.nn.conv2d(images, W_conv1, strides=[1, 2, 2, 1], padding='VALID'), b_conv1)

with tf.name_scope('conv2'):
    W_conv2 = tf.Variable(tf.truncated_normal([5, 5, FEATURE_MAPS_C1, 36], stddev=STDDEV))
    b_conv2 = tf.Variable(tf.constant(BIAS_INIT, shape=[36]))
    h_conv2 = tf.nn.conv2d(h_conv1, W_conv2, strides=[1, 2, 2, 1], padding='VALID') + b_conv2

with tf.name_scope('conv3'):
    W_conv3 = tf.Variable(tf.truncated_normal([5, 5, 36, 48], stddev=STDDEV))
    b_conv3 = tf.Variable(tf.constant(BIAS_INIT, shape=[48]))
    h_conv3 = tf.nn.conv2d(h_conv2, W_conv3, strides=[1, 2, 2, 1], padding='VALID') + b_conv3

with tf.name_scope('conv4'):
    W_conv4 = tf.Variable(tf.truncated_normal([3, 3, 48, 64], stddev=STDDEV))
    b_conv4 = tf.Variable(tf.constant(BIAS_INIT, shape=[64]))
    h_conv4 = tf.nn.conv2d(h_conv3, W_conv4, strides=[1, 1, 1, 1], padding='VALID') + b_conv4

with tf.name_scope('conv5'):
    W_conv5 = tf.Variable(tf.truncated_normal([3, 3, 64, 64], stddev=STDDEV))
    b_conv5 = tf.Variable(tf.constant(BIAS_INIT, shape=[64]))
    h_conv5 = tf.nn.conv2d(h_conv4, W_conv5, strides=[1, 1, 1, 1], padding='VALID') + b_conv5
    h_conv5_flat = tf.reshape(h_conv5, [-1, 1 * 18 * 64])


with tf.name_scope('fc1'):
    W_fc1 = tf.Variable(tf.truncated_normal([1 * 18 * 64, 100], stddev=STDDEV))
    b_fc1 = tf.Variable(tf.constant(BIAS_INIT, shape=[100]))
    h_fc1 = tf.matmul(h_conv5_flat, W_fc1) + b_fc1

with tf.name_scope('fc2'):
    W_fc2 = tf.Variable(tf.truncated_normal([100, 50], stddev=STDDEV))
    b_fc2 = tf.Variable(tf.constant(BIAS_INIT, shape=[50]))
    h_fc2 = tf.matmul(h_fc1, W_fc2) + b_fc2

with tf.name_scope('fc3'):
    W_fc3 = tf.Variable(tf.truncated_normal([50, 10], stddev=STDDEV))
    b_fc3 = tf.Variable(tf.constant(BIAS_INIT, shape=[10]))
    h_fc3 = tf.matmul(h_fc2, W_fc3) + b_fc3

with tf.name_scope('fc4'):
    W_fc4 = tf.Variable(tf.truncated_normal([10, 1], stddev=STDDEV))
    b_fc4 = tf.Variable(tf.constant(BIAS_INIT, shape=[1]))
    h_fc4 = tf.matmul(h_fc3, W_fc4) + b_fc4


return h_fc4

Here is the loss function, using l2 causes the issue.

def loss(label_estimations, labels):    
    n_labels = tf.reshape(label_estimations, [-1])
    # Here are the two loss functions
    #loss = tf.reduce_mean(tf.square(n_labels - labels))
    loss = tf.nn.l2_loss(n_labels-labels)
    return loss

Train method:

def training(loss, learning_rate, batch_size): 
    global_step = tf.Variable(0, name='global_step', trainable=False)
    tf.scalar_summary('learning_rate',learning_rate)
    tf.scalar_summary('Loss ('+loss.op.name+')', loss)

    optimizer = tf.train.AdamOptimizer(learning_rate)
    train_op = optimizer.minimize(loss, global_step=global_step)
    return train_op

Plot for tf.reduce_sum(tf.square(n_labels - labels)/2)

Imgur

Upvotes: 1

Views: 786

Answers (2)

Jeffion
Jeffion

Reputation: 91

As mentioned in TensorFlow's original guide https://www.tensorflow.org/programmers_guide/reading_data

How many threads do you need? the tf.train.shuffle_batch* functions add a summary to the graph that indicates how full the example queue is. If you have enough reading threads, that summary will stay above zero. You can view your summaries as training progresses using TensorBoard.

It seems better if the queue is never empty, i.e. the "fraction_full" stays non-zero. If not, you should allocate more threads to queue_runner

Upvotes: 1

lejlot
lejlot

Reputation: 66805

The only difference between your loss and l2 is scaling, thus you might need to play around with your learning rate / other hyperparameters to take this into account.

l2 loss in TF is defined as:

1/2 SUM_i^N (pred(x_i) - y_i)^2

while your cost is

1/N SUM_i^N (pred(x_i) - y_i)^2

Of course since you are using stochastic gradient approach, efficienty you are using an approximator of form

1/2 SUM_{(x_i, y_i) in batch} (pred(x_i) - y_i)^2 # l2
1/#batch SUM_{(x_i, y_i) in batch} (pred(x_i) - y_i)^2 # you

Thus you would have to multiply your cost by batch_size / 2 to get the original cost. Typically this is not a problem, but sometimes wrong scaling can put you in very degenerated parts of the error surface, and the optimizer will simply fail (especially such aggressive one like Adam).

Side note - you are aware that your model is a deep linear model? You do not have any non-linearities in the model. This is very specific network.

Upvotes: 0

Related Questions