Reputation: 71
I have a custom estimator and am trying to use some custom metrics during evaluation. However, whenever I add these metrics to evaluation, via eval_metric_ops the evaluation becomes really slow (much slower than training which is actually calculating the same metrics). If I don't add the metrics there then I can only see metrics in Tensorboard for training and not for evaluation.
What is the right way to add a custom metric for a custom estimator so that it is saved during evaluation.
This is what I have:
def compute_accuracy(preds, labels):
total = tf.shape(labels.values)[0]
preds = tf.sparse_to_dense(preds.indices, preds.dense_shape, preds.values, default_value=-1)
labels = tf.sparse_to_dense(labels.indices, labels.dense_shape, labels.values, default_value=-2)
r = tf.shape(labels)[0]
c = tf.minimum(tf.shape(labels)[1], tf.shape(preds)[1])
preds = tf.slice(preds, [0,0], [r,c])
labels = tf.slice(labels, [0,0], [r,c])
preds = tf.cast(preds, tf.int32)
labels = tf.cast(labels, tf.int32)
correct = tf.reduce_sum(tf.cast(tf.equal(preds, labels), tf.int32))
accuracy = tf.divide(correct, total)
return accuracy
In model_fn
edit_dist = tf.reduce_mean(tf.edit_distance(tf.cast(predicted_label[0], tf.int32), labels))
accuracy = compute_accuracy(predicted_label[0], labels)
tf.summary.scalar('edit_dist', edit_dist)
tf.summary.scalar('accuracy', accuracy)
metrics = {
'accuracy': tf.metrics.mean(accuracy),
'edit_dist':tf.metrics.mean(edit_dist),
}
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)
As requested, here is the complete model and TfRecord Writer code:
def crnn_model(features, labels, mode, params):
inputs = features['image']
print("INPUTS SHAPE", inputs.shape)
if mode == tf.estimator.ModeKeys.TRAIN:
batch_size = params['batch_size']
lr_initial = params['lr']
lr = tf.train.exponential_decay(lr_initial, global_step=tf.train.get_global_step(),
decay_steps=params['lr_decay_steps'], decay_rate=params['lr_decay_rate'],
staircase=True)
tf.summary.scalar('lr', lr)
else:
batch_size = params['test_batch_size']
with tf.variable_scope('crnn', reuse=False):
rnn_output, predicted_label, logits = CRNN(inputs, hidden_size=params['hidden_size'], batch_size=batch_size)
if mode == tf.estimator.ModeKeys.PREDICT:
predictions = {
'predicted_label': predicted_label,
'logits': logits,
}
return tf.estimator.EstimatorSpec(mode, predictions=predictions)
loss = tf.reduce_mean(tf.nn.ctc_loss(labels=labels, inputs=rnn_output,
sequence_length=23 * np.ones(batch_size),
ignore_longer_outputs_than_inputs=True))
edit_dist = tf.reduce_mean(tf.edit_distance(tf.cast(predicted_label[0], tf.int32), labels))
accuracy = compute_accuracy(predicted_label[0], labels)
metrics = {
'accuracy': tf.metrics.mean(accuracy),
'edit_dist':tf.metrics.mean(edit_dist),
}
tf.summary.scalar('loss', loss)
tf.summary.scalar('edit_dist', edit_dist)
tf.summary.scalar('accuracy', accuracy)
if mode == tf.estimator.ModeKeys.EVAL:
return tf.estimator.EstimatorSpec(mode, loss=loss, eval_metric_ops=metrics)
assert mode == tf.estimator.ModeKeys.TRAIN
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
optimizer = tf.train.AdadeltaOptimizer(learning_rate=lr)
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())
return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)
Tf Record Writer code
def _write_fn(self, out_file, image_list, label_list, mode):
writer = tf.python_io.TFRecordWriter(out_file)
N = len(image_list)
for i in range(N):
if (i % 1000) == 0:
print('%s Data: %d/%d records saved' % (mode, i,N))
sys.stdout.flush()
try:
#print('Try image: ', image_list[i])
image = load_image(image_list[i])
except (ValueError, AttributeError):
print('Ignoring image: ', image_list[i])
continue
label = label_list[i]
feature = {
'label': _int64_feature(label),
'image': _byte_feature(tf.compat.as_bytes(image.tostring()))
}
example = tf.train.Example(features=tf.train.Features(feature=feature))
writer.write(example.SerializeToString())
writer.close()
Upvotes: 2
Views: 1969
Reputation: 4000
In the Estimator framework, everything happens in the model_fn
, namely your crnn_model(features, labels, mode, params)
. This is why this function has such a complex signature.
The mode
parameter indicates whether it is called for training, evaluation or prediction. So, if you want to log additional summaries to tensorboard during the evaluation, you would add them under the if mode == tf.estimator.ModeKeys.EVAL
section, or outside any if
in the model_fn
.
I suppose your eval is much slower because you have different batch sizes for train/eval and the eval batch size could be smaller. You indicated this is not the case.
After a closer look at your code, and having experienced with a similar model, I believe that the evaluation takes longer with metrics because one of the metrics is edit_distance()
which is implemented sequentially on the CPU. During training, this op
is not required so it is not run.
What I suggest is that you run your train()
and evaluate()
in different programs, with the same model_fn()
and model_dir
. This way, train
does not need to wait for evaluate
. And evaluate
will run only when necessary, i.e. when there are new checkpoints in the model_dir
. If you don't have 2 GPUs for this, you can either split the GPU memory between the two processes (using a custom run-config with gpu_memory_fraction=0.75
for train) or by hiding the GPU from evaluate()
with CUDA_VISIBLE_DEVICES=''
environment variable
Upvotes: 2