YOLOv9e-seg training on 6 A100-80G and tried to optimize as much as I could but after the validation stage there is the CUDA out of memory error

Question

I am trying to train YOLOv9e-seg model on 336 total images of size 4096x4096 which have been split in train and val in the ratio 80:20. Previously I used to have error even from the training part but with some optimizations in the train method's parameters I was able to overcome that error. I am not sure but the validation was gets done a few older version of my code and then for some step I used to get this error but in the current version the program fails in the validation step with the program gives the "torch.OutOfMemoryError: CUDA out of memory" error. Code for the training is below:

import os
import torch
import atexit
import gc
from ultralytics import YOLO
from torch.nn import DataParallel

# Remap GPUs to a contiguous set using CUDA_VISIBLE_DEVICES.
# For example, if you want to use physical GPUs 0, 1, 3, 4, 5, 6:
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,3,4,5,6"

# Set environment variable to help reduce memory fragmentation.
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

# Function to clear GPU memory.
def clear_gpu_memory():
    torch.cuda.empty_cache()
    torch.cuda.reset_peak_memory_stats()

# Ensure that GPU memory is cleared on exit.
atexit.register(clear_gpu_memory)

# Load the pretrained YOLOv9 segmentation model and compile it.
model = YOLO("yolov9e-seg.pt")
model.model = torch.compile(model.model)

try:
    # Train the model with your specified parameters.
    model.train(
        data='training_data/brain_data.yaml',
        epochs=2,
        imgsz=4096,
        batch=6,
        project='brain_segmentation',
        name='testrun',
        device=[0, 1, 3, 4, 5, 6],
        close_mosaic=1,
        save_period=1,
        amp=True,
        cache=False,
        overlap_mask=False,
        workers=4,
    )

    # If available, try deleting the optimizer to free memory.
    try:
        del model.optimizer
    except AttributeError:
        pass

    # Force garbage collection and clear cached GPU memory after training.
    gc.collect()
    torch.cuda.empty_cache()

    # Get the number of GPUs now visible (they are renumbered from 0 to N-1).
    available_gpus = torch.cuda.device_count()
    print(f"Available GPUs (contiguous numbering): {list(range(available_gpus))}")

    # Wrap the model in DataParallel for training.
    model.model = DataParallel(model.model, device_ids=list(range(available_gpus)))
    model.model.to('cuda')

    # --- Before validation, unwrap and fuse the model ---
    # The fused model is expected to be used on a single device, so we unwrap the DataParallel container.
    if isinstance(model.model, DataParallel):
        # Unwrap and call the underlying fuse() method.
        fused_module = model.model.module.fuse(verbose=False)
        model.model = fused_module
    else:
        model.model = model.model.fuse(verbose=False)
    
    print("Model fused.")

    # Validate using memory optimizations:
    # - torch.inference_mode() to disable gradient tracking.
    # - torch.amp.autocast with device_type='cuda' for mixed-precision inference.
    with torch.inference_mode():
        with torch.amp.autocast(device_type='cuda'):
            model.val(
                device=list(range(available_gpus)),
                batch=6,
                imgsz=4096
            )
    
    print("Validation complete.")

    # Export the fused model to ONNX (typically done on a single GPU).
    model.export(
        device=0,
        imgsz=4096,
        half=True,
        simplify=True,
        opset=12
    )

except KeyboardInterrupt:
    print("Training interrupted. Clearing GPU memory...")
    clear_gpu_memory()
    raise

except Exception as e:
    print(f"An error occurred: {e}. Clearing GPU memory...")
    clear_gpu_memory()
    raise

My config file is training_data/brain_data.yaml :

path: work_my/new_yolo_4096/training_data
train:
  - images/train  # Path to training images
  - labels/train  # Path to training annotations
val:
  - images/val  # Path to validation images
  - labels/val  # Path to validation annotations

nc: 25
names: ['Thalamus', 'Caudate nucleus', 'Putamen', 'Globus pallidus', 'Nucleus accumbens', 'Internal capsule', 'Substantia innominata', 'Fornix', 'Anterior commissure', 'Ganglionic eminence', 'Hypothalamus', 'Amygdala', 'Hippocampus', 'Choroid plexus', 'Lateral ventricle', 'Olfactory tubercle', 'Pretectum', 'Inferior colliculus', 'Superior colliculus', 'Tegmentum', 'Pons', 'Medulla', 'Cerebellum', 'Corpus callosum', 'Cerebral cortex']

Some points:

my training data is properly prepared, and there is no issue on that part of loading the data or the issue of wrong paths in the config
I want to train my model on the same resolution of 4096x4096 so please don't suggest reducing the image size.
batch size must be equal to the number of devices so the minimum is 6 so have kept the same, cant reduce further can only increase in multiples of 6 (wouldn't wont to do that because already I am out of memory)
All the GPUs are empty and no prior memory or computer was taken by any other program.

Training gets completed in this part:

Starting training for 2 epochs...

Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
1/2 81G 2.821 6.069 54.57 2.973 30 4096: 100%|██████████| 45/45 [00:56<00:00, 1.25s/it]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|██████████| 34/34 [00:27<00:00, 1.26it/s]
all 68 2181 0.00188 0.0077 0.00106 0.000624 0.000624 0.00377 0.000345 0.000176
Closing dataloader mosaic

Epoch GPU_mem box_loss seg_loss cls_loss dfl_loss Instances Size
2/2 60.8G 2.783 4.883 49.26 2.989 37 4096: 100%|██████████| 45/45 [00:49<00:00, 1.09s/it]
Class Images Instances Box(P R mAP50 mAP50-95) Mask(P R mAP50 mAP50-95): 100%|██████████| 34/34 [00:27<00:00, 1.24it/s]
all 68 2181 0.00188 0.0077 0.00106 0.000624 0.000624 0.00377 0.000345 0.000176

2 epochs completed in 0.048 hours.
Optimizer stripped from brain_segmentation/testrun21/weights/last.pt, 124.0MB
Optimizer stripped from brain_segmentation/testrun21/weights/best.pt, 124.0MB

Then the error:

Results saved to brain_segmentation/testrun21
Ultralytics 8.3.74 🚀 Python-3.10.12 torch-2.6.0+cu124 CUDA:0 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:1 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:3 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:4 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:5 (NVIDIA A100-SXM4-80GB, 81051MiB)
CUDA:6 (NVIDIA A100-SXM4-80GB, 81051MiB)
YOLOv9e-seg summary (fused): 714 layers, 59,700,955 parameters, 0 gradients, 244.5 GFLOPs
val: Scanning /storage/lab_user/work_my/new_yolo_4096/training_data/labels/val.cache... 68 images, 0 backgrounds, 0 corrupt: 100%|██████████| 68/68 [00:00
model.val(
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/engine/model.py", line 640, in val
validator(model=self.model)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/engine/validator.py", line 182, in __call__
preds = model(batch["img"], augment=augment)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/nn/autobackend.py", line 555, in forward
y = self.model(im, augment=augment, visualize=visualize, embed=embed)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)

File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 109, in forward
return self.predict(x, *args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 127, in predict
return self._predict_once(x, profile, visualize, embed)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/nn/tasks.py", line 148, in _predict_once
x = m(x) # run
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
return forward_call(*args, **kwargs)
File "/storage/lab_user/work_my/new_yolo_4096/4096env_yolo/lib/python3.10/site-packages/ultralytics/nn/modules/block.py", line 701, in forward
return torch.sum(torch.stack(res + xs[-1:]), dim=0)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 36.57 GiB. GPU 0 has a total capacity of 79.15 GiB of which 7.22 GiB is free. Process 2699867 has 71.91 GiB memory in use. Of the allocated memory 71.28 GiB is allocated by PyTorch, and 120.91 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
terminate called without an active exception
Aborted (core dumped)

YOLOv9e-seg training on 6 A100-80G and tried to optimize as much as I could but after the validation stage there is the CUDA out of memory error

Answers (0)

Related Questions