Reputation: 41
I have probably a "bloated graph" see (Why does tf.assign() slow the execution time?) since each epoch taking more and more time but I can't see it in my code. Can you please help me, still a Tensorflow newbie.
# NEURAL NETWORK
def MLP(x, weights, biases, is_training):
# Hiden layer 1
hLayer1 = tf.add(tf.matmul(x, weights["w1"]), biases["b1"])
hLayer1 = tf.nn.sigmoid(hLayer1)
bn1 = batch_norm_wrapper(hLayer1, gamma=weights["gamma1"], beta=weights["beta1"], is_training=is_training, name="1")
hLayer1 = bn1
# Hiden layer 2
hLayer2 = tf.add(tf.matmul(hLayer1, weights["w2"]), biases["b2"])
hLayer2 = tf.nn.sigmoid(hLayer2)
bn2 = batch_norm_wrapper(hLayer2, gamma=weights["gamma2"], beta=weights["beta2"], is_training=is_training, name="2")
hLayer2 = bn2
# Output layer
outLayer = tf.add(tf.matmul(hLayer2, weights["wOut"]), biases["bOut"], name="outLayer")
return outLayer
# Weights and biases
weights = {
"w1": tf.get_variable(shape=[n_input, n_hLayer1], initializer=tf.keras.initializers.he_normal(seed=5), name="w1", trainable=True),
"w2": tf.get_variable(shape=[n_hLayer1, n_hLayer2], initializer=tf.keras.initializers.he_normal(seed=5), name="w2", trainable=True),
"wOut": tf.get_variable(shape=[n_hLayer2, n_classes], initializer=tf.keras.initializers.he_normal(seed=5), name="wOut", trainable=True),
"gamma1": tf.get_variable(shape=[n_hLayer1], initializer=tf.ones_initializer(), name="gamma1", trainable=True),
"beta1": tf.get_variable(shape=[n_hLayer1], initializer=tf.zeros_initializer(), name="beta1", trainable=True),
"gamma2":tf.get_variable(shape=[n_hLayer2], initializer=tf.ones_initializer(), name="gamma2", trainable=True),
"beta2": tf.get_variable(shape=[n_hLayer2], initializer=tf.zeros_initializer(), name="beta2", trainable=True)
}
biases = {
"b1": tf.get_variable(shape=[n_hLayer1], initializer=tf.zeros_initializer(), name="b1", trainable=True),
"b2": tf.get_variable(shape=[n_hLayer2], initializer=tf.zeros_initializer(), name="b2", trainable=True),
"bOut": tf.get_variable(shape=[n_classes], initializer=tf.zeros_initializer(), name="bOut", trainable=True)
}
def batch_norm_wrapper(inputs, gamma, beta, is_training, name, decay=0.999):
pop_mean = tf.Variable(tf.zeros([inputs.get_shape()[-1]]), name="pop_mean{}".format(name), trainable=False)
pop_var = tf.Variable(tf.zeros([inputs.get_shape()[-1]]), name="pop_var{}".format(name), trainable=False)
if is_training:
batch_mean, batch_var = tf.nn.moments(inputs, [0])
train_mean = tf.assign(pop_mean, pop_mean*decay + batch_mean*(1-decay))
train_var = tf.assign(pop_var, pop_var*decay + batch_var*(1-decay))
with tf.control_dependencies([train_mean, train_var]):
return tf.nn.batch_normalization(x=inputs, mean=batch_mean, variance=batch_var, scale=gamma, offset=beta, variance_epsilon=0.001)
else:
return tf.nn.batch_normalization(x=inputs, mean=pop_mean, variance=pop_var, scale=gamma, offset=beta, variance_epsilon=0.001)
# Model
predictions = MLP(next_element[0], weights, biases, is_training=True)
# Loss function and regularization
loss = tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(logits=predictions, labels=next_element[1]))
l1_regularizer = tf.reduce_sum(tf.abs(weights["w1"])) + tf.reduce_sum(tf.abs(weights["w2"])) + tf.reduce_sum(tf.abs(weights["wOut"]))
l2_regularizer = tf.reduce_mean(tf.nn.l2_loss(weights["w1"]) + tf.nn.l2_loss(weights["w2"]) + tf.nn.l2_loss(weights["wOut"]))
loss = loss + r*alpha1*l1_regularizer + (1-r)*alpha2*l2_regularizer
# Optimizer
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
# LAUNCH THE GRAPH
with tf.Session() as sess:
sess.run(init_op)
# Training
for trainEpoch in range(training_epochs):
sess.run(training_iterator_op)
while True:
try:
value = sess.run(next_element)
sess.run([loss, optimizer])
except tf.errors.OutOfRangeError:
break
I use the dataset API to run throught my training data.
Upvotes: 4
Views: 3163
Reputation: 228
I thought I'd share my findings on training issues related to your question in TensorFlow 2.x.x (in my case 2.4.1). Here is what I found after hours and hours of research on the internet.
Upvotes: 2
Reputation: 16587
There some reasons that cause slow down during training:
This is most likely due to your training loop holding on to some things it shouldn’t. Also makes sure that you are not storing some temporary computations in an ever growing list without deleting them.
If you are using custom network/loss
function, it is also possible that the computation gets more expensive as you get closer to the optimal solution. To track this down, you could get timings for different parts separately: data loading, network forward, loss computation, backward pass and parameter update. Hopefully just one will increase and you will be able to see better what is going on.
Reference: Training gets slow down by each batch slowly [in Pytorch]
Read also: training slow down [in Keras]
Upvotes: 1