Stefan Falk
Stefan Falk

Reputation: 25387

Getting interrupted by signal 11: SIGSEGV

All I know is that the error occurs when this branch gets executed and the weights from it get passed down to tf.data.experimental.sample_from_datasets:

# ...
elif pretrain_cfg.schedule == PretrainSchedule.CONVERGE_LINEARLY:
    logger.info('[%s] - Pretrain: Using CONVERGE_LINEARLY schedule' % self.name)
    a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
    b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
    weights = a * const_task_weights + b * pretrain_task_weights

return tf.data.experimental.sample_from_datasets(datasets, weights=weights)

The following works:

weights = tf.cond(
    tf.greater(global_step, max_pretrain_steps),
    true_fn=lambda: const_task_weights,
    false_fn=lambda: pretrain_task_weights
)

but for some reason this here causes the SIGSEGV:

a = tf.minimum(tf.constant(1.0, dtype=tf.float64, shape=(1,)), global_step / max_pretrain_steps)
b = tf.maximum(tf.constant(0.0, dtype=tf.float64, shape=(1,)), 1 - global_step / max_pretrain_steps)
weights = a * const_task_weights + b * pretrain_task_weights

I don't really see what the problem is but the problem comes definitely from this line:

weights = a * const_task_weights + b * pretrain_task_weights

The question is why. It might not be valid to have a dependency to the global_step in this context as since the weights parameter of sample_from_datasets.

However, in sample_from_datasets I don't see anything suspicious since inside sample_from_datasets the first thing that happens is

weights = ops.convert_to_tensor(weights, name="weights")

So passing a tensor to it should be fine.

Any ideas?


Error output:

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /data/translation/multi-problem/hi2en/model/512-3-1-1024/de2en.hi2en/c19cfad259cad911/model.ckpt.
bash: line 1:  4153 Segmentation fault      (core dumped) env "CUDA_VISIBLE_DEVICES"="0" "LIBRARY_ROOTS"="/Users/username/Library/Caches/PyCharm2018.2/remote_sou...

Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Upvotes: 0

Views: 752

Answers (1)

Stefan Falk
Stefan Falk

Reputation: 25387

Okay, it turns out the problem is not with tensorflow directly - actually not at all.

The problem was because const_task_weights and pretrain_task_weights did not have the same shape. I did not validate the input and had a bug somewhere else.

Just be aware that you might get this kind of error if the shapes do not match.

I guess this cannot be checked or determined by tensorflow so that will be something the user has to take care of (citation needed).

Upvotes: 1

Related Questions