Reputation: 25387
All I know is that the error occurs when this branch gets executed and the weights
from it get passed down to tf.data.experimental.sample_from_datasets
:
# ...
elif pretrain_cfg.schedule == PretrainSchedule.CONVERGE_LINEARLY:
logger.info('[%s] - Pretrain: Using CONVERGE_LINEARLY schedule' % self.name)
a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
weights = a * const_task_weights + b * pretrain_task_weights
return tf.data.experimental.sample_from_datasets(datasets, weights=weights)
The following works:
weights = tf.cond(
tf.greater(global_step, max_pretrain_steps),
true_fn=lambda: const_task_weights,
false_fn=lambda: pretrain_task_weights
)
but for some reason this here causes the SIGSEGV
:
a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
weights = a * const_task_weights + b * pretrain_task_weights
I don't really see what the problem is but the problem comes definitely from this line:
weights = a * const_task_weights + b * pretrain_task_weights
The question is why. It might not be valid to have a dependency to the global_step
in this context as since the weights
parameter of sample_from_datasets
.
However, in sample_from_datasets
I don't see anything suspicious since inside sample_from_datasets
the first thing that happens is
weights = ops.convert_to_tensor(weights, name="weights")
So passing a tensor to it should be fine.
Any ideas?
Error output:
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /data/translation/multi-problem/hi2en/model/512-3-1-1024/de2en.hi2en/c19cfad259cad911/model.ckpt.
bash: line 1: 4153 Segmentation fault (core dumped) env "CUDA_VISIBLE_DEVICES"="0" "LIBRARY_ROOTS"="/Users/username/Library/Caches/PyCharm2018.2/remote_sou...
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
Upvotes: 0
Views: 752
Reputation: 25387
Okay, it turns out the problem is not with tensorflow directly - actually not at all.
The problem was because const_task_weights
and pretrain_task_weights
did not have the same shape. I did not validate the input and had a bug somewhere else.
Just be aware that you might get this kind of error if the shapes do not match.
I guess this cannot be checked or determined by tensorflow so that will be something the user has to take care of (citation needed).
Upvotes: 1