For deep learning, With activation relu the output becomes NAN during training while is normal with tanh

Question

The neural network I trained is the critic network for deep reinforcement learning. The problem is when one of the layer's activation is set to be relu or elu, the output would be nan after some training step, while the output is normal if the activation is tanh. And the code is as follows(based on tensorflow):

with tf.variable_scope('critic'):

        self.batch_size = tf.shape(self.tfs)[0]

        l_out_x = denseWN(x=self.tfs, name='l3', num_units=self.cell_size, nonlinearity=tf.nn.tanh, trainable=True,shape=[det*step*2, self.cell_size])

        l_out_x1 = denseWN(x=l_out_x, name='l3_1', num_units=32, trainable=True,nonlinearity=tf.nn.tanh, shape=[self.cell_size, 32])
        l_out_x2 = denseWN(x=l_out_x1, name='l3_2', num_units=32, trainable=True,nonlinearity=tf.nn.tanh,shape=[32, 32])
        l_out_x3 = denseWN(x=l_out_x2, name='l3_3', num_units=32, trainable=True,shape=[32, 32])

        self.v = denseWN(x=l_out_x3, name='l4', num_units=1,  trainable=True, shape=[32, 1])

Here is the code for basic layer construction:

def get_var_maybe_avg(var_name, ema,  trainable, shape):
    if var_name=='V':
        initializer = tf.contrib.layers.xavier_initializer()
        v = tf.get_variable(name=var_name, initializer=initializer, trainable=trainable, shape=shape)
    if var_name=='g':
        initializer = tf.constant_initializer(1.0)
        v = tf.get_variable(name=var_name, initializer=initializer, trainable=trainable, shape=[shape[-1]])
    if var_name=='b':
        initializer = tf.constant_initializer(0.1)
        v = tf.get_variable(name=var_name, initializer=initializer, trainable=trainable, shape=[shape[-1]])
    if ema is not None:
        v = ema.average(v)
    return v

def get_vars_maybe_avg(var_names, ema, trainable, shape):
    vars=[]
    for vn in var_names:
        vars.append(get_var_maybe_avg(vn, ema, trainable=trainable, shape=shape))
    return vars

def denseWN(x, name, num_units, trainable, shape, nonlinearity=None, ema=None, **kwargs):
    with tf.variable_scope(name):
        V, g, b = get_vars_maybe_avg(['V', 'g', 'b'], ema, trainable=trainable, shape=shape)
        x = tf.matmul(x, V)
        scaler = g/tf.sqrt(tf.reduce_sum(tf.square(V),[0]))
        x = tf.reshape(scaler,[1,num_units])*x + tf.reshape(b,[1,num_units])
        if nonlinearity is not None:
            x = nonlinearity(x)
        return x

Here is the code to train the network:

self.tfdc_r = tf.placeholder(tf.float32, [None, 1], 'discounted_r')
self.advantage = self.tfdc_r - self.v
l1_regularizer = tf.contrib.layers.l1_regularizer(scale=0.005, scope=None)
self.weights = tf.trainable_variables()
regularization_penalty_critic = tf.contrib.layers.apply_regularization(l1_regularizer, self.weights)
self.closs = tf.reduce_mean(tf.square(self.advantage))
self.optimizer = tf.train.RMSPropOptimizer(0.0001, 0.99, 0.0, 1e-6)
self.grads_and_vars = self.optimizer.compute_gradients(self.closs)
self.grads_and_vars = [[tf.clip_by_norm(grad,5), var] for grad, var in self.grads_and_vars if grad is not None]
self.ctrain_op = self.optimizer.apply_gradients(self.grads_and_vars, global_step=tf.contrib.framework.get_global_step())

Maxim · Accepted Answer

Looks like you're facing the problem of exploding gradients with ReLu activation function (that what NaN means -- very big activations). There are several techniques to deal with this issue, e.g. batch normalization (changes the network architecture) or a delicate variable initialization (that's what I'd try first).

You are using Xavier initialization for V variables in different layers, which indeed works fine for logistic sigmoid activation (see the paper by Xavier Glorot and Yoshua Bengio), or, in other words, tanh.

The preferred initialization strategy for the ReLU activation function (and its variants, including ELU) is He initialization. In tensorflow it's implemented via tf.variance_scaling_initializer:

initializer = tf.variance_scaling_initializer()
v = tf.get_variable(name=var_name, initializer=initializer, ...)

You might also want to try smaller values for b and g variables, but it's hard to say the exact value just by looking at your model. If nothing helps, consider adding batch-norm layers to your model to control activation distribution.

For deep learning, With activation relu the output becomes NAN during training while is normal with tanh

Answers (1)

Related Questions