Thermodynamix
Thermodynamix

Reputation: 367

TensorFlow averaging minibatch gradients in parallel

I want to train a neural network using batch gradient descent, but I'd like to parallelize the process. I want to separate the batch into mini-batches, distribute the gradient calculation across processes, and then bring them back to the master process to average them and apply them to the training.

As a simple example, take this script that trains a neural net on N data points for the parabola y = x^2:

import tensorflow as tf
import numpy as np

def add_layer(inputs, in_size, out_size, activation_function=None):
    Weights = tf.Variable(tf.random_normal([in_size, out_size]))
    biases = tf.Variable(tf.random_normal([1, out_size]))
    Wx_plus_b = tf.matmul(inputs, Weights) + biases
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

# Make up some real data
N = 50
x_data = np.linspace(-2, 2, N)[:, np.newaxis]
noise = np.random.normal(0, 0.05, x_data.shape)
y_data = np.square(x_data) # - 0.5 + noise

# Define placeholder for x_data and y_data
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])

""" Build the network"""
# Add hidden layer
l1 = add_layer(xs, 1, 5, activation_function=tf.tanh)
# Add output layer
prediction = add_layer(l1, 5, 1, activation_function=None)

# Define loss
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), reduction_indices=[1]))

# Define optimizer
opt = tf.train.GradientDescentOptimizer(learning_rate=1e-2)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# Ask the optimizer to apply the gradients
train_opt = opt.apply_gradients(grads_and_vars)

# Initialize global variables
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)

for i in range(2000):
    # training
    sess.run(train_opt, feed_dict={xs: x_data, ys: y_data})
    if i % 50 == 0:
        prediction_value = sess.run(prediction, feed_dict={xs: x_data})
        print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))

The section I want to parallelize is the computation of the gradients, and then I want to bring these gradients back to the master process to be averaged and then applied to the training step. I want to split the N data points in x_data over P processes.

I think this is the so called "synchronous training", which I've seen resources for but no one ever explains it.

How can I parallelize this simple example in a synchronous manner?

Upvotes: 2

Views: 1626

Answers (1)

toto2
toto2

Reputation: 5326

You probably won't find much on synchronous training because it was mostly abandoned in favor of asynchronous training.

In synchronous gradient descent, all the mini-batches have to finish and their respective gradients are all applied at the same time to update the network parameters. In the asynchronous case, the network parameters are updated each time the gradient from one mini-batch is available. Those updates come in more or less random order. It seems that this method is not valid: for example, let's say the network parameters have been iterated 1342 times and you start computing the gradient for some mini-batch. By the time that computation is finished, the network parameters might have been updated 1349 times because 7 older mini-batches reported their gradients. So you would be applying a gradient correction to network parameters that are not those that were specified at the start of the computation.

From what I wrote above it seems that asynchronous descent is wrong, but you have to understand that stochastic gradient descent is a sloppy/inexact process, and adding the extra sloppiness from asynchronous updates is not detrimental. On the other hand, when doing synchronous updates, some GPUs will frequently be sitting idle because they have to wait for all other GPUs to finish.

I quickly tried to find an appropriate reference about this on the web but could not. I remember that the trick of using asynchronous updates was rediscovered many times by different groups. There is this old paper from Jeff Dean, but they don't analyze synchronous vs asynchronous.

The official tensorflow documentation has an example with asynchronous training, but there might be better tutorials.

The web page I linked above also points to this synchronous training example.

Upvotes: 3

Related Questions