Reputation: 6538
In my recurrent model (sequential binary classifier) at each time step t
I need to perform the following input transformation:
[32 x 4] --> [32 x 100]
So, if my sequence length is 3, I should have:
[32 x 4] --> [32 x 100]
[32 x 4] --> [32 x 100]
[32 x 4] --> [32 x 100]
I do it by applying linear transform xW + b
on [32 x 4]
tensor at each time step t
. My working Torch implementation of the model shows the mean of linear weigths change each epoch:
Epoch #1
0.0012639100896195
0.0012639100896195
0.0012639100896195
Epoch #2
0.0039414558559656
0.0039414558559656
0.0039414558559656
Epoch #3
-0.0099147083237767
-0.0099147083237767
-0.0099147083237767
The backward pass updates the weights, everything works. However, when I attempt to do the same in Tensorflow the mean stay the same or updated very slightly at each epoch:
Epoch: 1
> lr update: 0.0497500005
#################### DEBUGGING ####################
0.051794354 Model/input_layer2/linear_weigth:0
0.06118914 Model/input_layer2_bias/linear_bias:0
Epoch: 2
> lr update: 0.049500001
#################### DEBUGGING ####################
0.051794227 Model/input_layer2/linear_weigth:0
0.06118797 Model/input_layer2_bias/linear_bias:0
Epoch: 3
> lr update: 0.0492500015
#################### DEBUGGING ####################
0.051794235 Model/input_layer2/linear_weigth:0
0.06118701 Model/input_layer2_bias/linear_bias:0
Tensorflow linear implementation is very simple:
def linear(input)
return tf.add(tf.matmul(input, self.linear_weight), self.linear_bias)
expanded = [linear(batch_seq) for batch_seq in unstacked_input]
Both self.linear_weight
and self.linear_bias
are trainable and are initialized as tf.Variables
during graph construction. Both Torch and TF models use identical training datasets, hyperparameters. Torch and TF model size (number of params) are the same as well. Needless to say that Torch model trains and shows good results on test data while TF model does not train at all.
Since I am new to TF, could you give some tips what could be wrong with TF model? I understand that's a very long shot without complete code but maybe I am missing something TF specific here.
You might have noticed that in Torch we have 3 mean values per each linear operation at time step t
while in TF I get 2 means -- one comes from linear and the other from bias. If instead of linear()
I use tf.layers.dense
call without name
parameter I actually have 3 mean values per dense call. But in that case TF will create a different mean value per dense call which we don't want to do.
Here is the training chunk of TF code which should do all the forward/backward magic but it does not:
if self.training:
self.lr = tf.Variable(0.0, trainable=False)
tvars = tf.trainable_variables()
# clip the gradient by norm
grads, _ = tf.clip_by_global_norm(tf.gradients(self.cost, tvars), config.grad_clip)
# update variables (weights, biases, embeddings...)
with tf.name_scope("optimizer"):
optimizer = tf.train.AdamOptimizer(self.lr)
# compute grads/vars for tensorboard
self.grads_and_vars = optimizer.compute_gradients(loss)
# debugging only, this is how I get the weights and grads
for g, v in self.grads_and_vars:
self.param_vals[v.name] = v
self.param_grads[v.name+'_grads'] = g
self.train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=tf.train.get_or_create_global_step())
Tensorboard screenshots after the model stopped training after 38 epochs due to validation loss do not descrease anymore. I am also not very familiar with TB and I can only tell that something is definitely not right according to histograms.
# collecting data for tb
tf.summary.scalar("Training loss", model_train.cost)
tf.summary.scalar("Learning rate", model_train.lr)
tf.summary.histogram("Training loss", model_train.cost)
for g, v in model_train.grads_and_vars:
tf.summary.histogram(v.name, v)
tf.summary.histogram(v.name + '_grad', g)
Upvotes: 2
Views: 407
Reputation: 6538
It appears that I was applying loss = tf.sigmoid(logits)
(as in original Torch model) and then feeding loss
to tf.losses.sigmoid_cross_entropy
. This brought gradients to nearly zero and the weights were not updated properly. When I removed tf.sigmoid
function the gradients increased the weights started moving.
logits = tf.nn.xw_plus_b(last_layer, self.output_w, self.output_b)
floss = tf.losses.sigmoid_cross_entropy
#floss = tf.nn.sigmoid_cross_entropy_with_logits
loss = floss(self.targets_input, logits, weights=1.0, label_smoothing=0,
scope="sigmoid_cross_entropy", loss_collection=tf.GraphKeys.LOSSES)
Upvotes: 1