TGD
TGD

Reputation: 56

Segmentation Fault During Validation with MirroredStrategy on Multiple GPUs

I am training a model using TensorFlow 2.18.0 with the tf.distribute.MirroredStrategy across two GPUs. The training works fine on a single GPU, but when I try to run it on two GPUs, it ends with a segmentation fault during validation.

Here is a snippet of my code:

from config import MainConfig
from dataset import dataset
from model2 import build_tf_model
from utils import CustomModelCheckpoint, get_lr_callback
import tensorflow as tf

checkpoint_callback_val = CustomModelCheckpoint(
    "models/val_model_{epoch:02d}_{val_acc_l:.1f}.keras",
    monitor="val_acc_l",
    save_best_only=True,
    mode="max",
    verbose=0,
    start_epoch=5
)

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

input_shape = (MainConfig.sequence_length, MainConfig.features)
train_sequences, train_labels, validation_sequences, validation_labels = dataset() # numpy arrays of shape (samples, sequences, features)
strategy = tf.distribute.MirroredStrategy(devices=["/GPU:0", "/GPU:1"])

with strategy.scope():    
    model = build_tf_model(input_shape)
    model.fit(train_sequences, train_labels,
        validation_data=(validation_sequences, validation_labels),
        epochs=MainConfig.epochs,
        shuffle=True,
        batch_size=MainConfig.train_batch_size,
        callbacks=[checkpoint_callback_val, get_lr_callback()]
    )

I have tried the following for the validation dataset:

Passing validation_sequences and validation_labels as NumPy arrays Using a tf.data.Dataset object Using a distributed dataset created with strategy.experimental_distribute_dataset Regardless of these attempts, the segmentation fault persists when using two GPUs. Here's the stack trace:

File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/gen_experimental_dataset_ops.py", line 335 in auto_shard_dataset
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/data/experimental/ops/distribute.py", line 74 in __init__
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_ops.py", line 56 in auto_shard_dataset
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 919 in _create_cloned_datasets_from_dataset
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 834 in build
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_lib.py", line 804 in __init__
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/input_util.py", line 65 in get_distributed_dataset
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/mirrored_strategy.py", line 592 in _experimental_distribute_dataset
File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 1468 in experimental_distribute_dataset
File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 668 in __init__
File "/usr/local/lib/python3.11/dist-packages/keras/src/backend/tensorflow/trainer.py", line 334 in fit
File "/usr/local/lib/python3.11/dist-packages/keras/src/utils/traceback_utils.py", line 117 in error_handler
File "/mnt/train/train2.py", line 28 in <module>

Has anyone encountered this issue or have any insights into why this might be happening with two GPUs? Are there any specific considerations or configurations needed for validation datasets when using MirroredStrategy with multiple GPUs?

Thank you in advance for your help!

Upvotes: 1

Views: 47

Answers (0)

Related Questions