user288609
user288609

Reputation: 13025

mini-batch gradient descent implementation in tensorflow

When reading an tensorflow implementation for a deep learning model, I am trying to understand the following code segment included in the training process.

self.net.gradients_node = tf.gradients(loss, self.variables)
for epoch in range(epochs):
            total_loss = 0
            for step in range((epoch*training_iters), ((epoch+1)*training_iters)):
                batch_x, batch_y = data_provider(self.batch_size)

                # Run optimization op (backprop)
                _, loss, lr, gradients = sess.run((self.optimizer, self.net.cost, self.learning_rate_node, self.net.gradients_node), 
                                                  feed_dict={self.net.x: batch_x,
                                                             self.net.y: util.crop_to_shape(batch_y, pred_shape),
                                                             self.net.keep_prob: dropout})

                if avg_gradients is None:
                    avg_gradients = [np.zeros_like(gradient) for gradient in gradients]
                for i in range(len(gradients)):
                    avg_gradients[i] = (avg_gradients[i] * (1.0 - (1.0 / (step+1)))) + (gradients[i] / (step+1))

                norm_gradients = [np.linalg.norm(gradient) for gradient in avg_gradients]
                self.norm_gradients_node.assign(norm_gradients).eval()



                total_loss += loss

I think it is related to mini-batch gradient descent, but I cannot understand how does it work, or I have some difficulties to connect it to the algorithm shown as follows

enter image description here

Upvotes: 1

Views: 3811

Answers (1)

Ishamael
Ishamael

Reputation: 12795

This is not related to mini batch SGD.

It computes average gradient over all timesteps. After the first timestep avg_gradients will contain the gradient that was just computed, after the second step it will be elementwise mean of the two gradients from the two steps, after n steps it will be elementwise mean of all the n gradients computed so far. These mean gradients are then normalized (so that their norm is one). It is hard to tell why those average gradients are needed without the context in which they were presented.

Upvotes: 0

Related Questions