林芷翎
林芷翎

Reputation: 1

OutOfMemoryError with PatchCore Training on 23.67 GiB GPU

I’m training a PatchCore model with an image size of 128x512 on a GPU with 23.67 GiB memory. However, I’m encountering the following error:

CUDA Version: 12.4
PyTorch Version: 2.5.1

OutOfMemoryError: CUDA out of memory. Tried to allocate 2.17 GiB. GPU 0 has a total capacity of 23.67 GiB of which 47.88 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 23.29 GiB is allocated by PyTorch, and 15.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management.

Configuration (yaml):

data:
  class_path: anomalib.data.Folder
  init_args:
    name: train_data
    root: ""
    image_size:
      - 128
      - 512
    normal_dir: ""
    abnormal_dir: ""
    normal_test_dir: ""
    mask_dir: ""
    normal_split_ratio: 0
    extensions: [".png"]
    train_batch_size: 4
    eval_batch_size: 4
    num_workers: 8
    train_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.RandomAdjustSharpness
            init_args:
              sharpness_factor: 0.7
              p: 0.5
          - class_path: torchvision.transforms.v2.RandomHorizontalFlip
            init_args:
              p: 0.5
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]
    eval_transform:
      class_path: torchvision.transforms.v2.Compose
      init_args:
        transforms:
          - class_path: torchvision.transforms.v2.Resize
            init_args:
              size: [128, 512]
          - class_path: torchvision.transforms.v2.Normalize
            init_args:
              mean: [0.485, 0.456, 0.406]
              std: [0.229, 0.224, 0.225]

model:
  class_path: anomalib.models.Patchcore
  init_args:
    backbone: wide_resnet50_2
    layers:
      - layer2
      - layer3
    pre_trained: true
    coreset_sampling_ratio: 0.1
    num_neighbors: 9

Steps I’ve Tried:

Lowering the batch size: I reduced the batch size to as low as 1, but the issue persists.

Checking for memory fragmentation: Followed the suggestion in the error to set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True. However, this did not solve the problem.

Ensuring no memory leakage: Verified that no other processes are consuming GPU memory using nvidia-smi, but the allocated memory remains maxed out during training.

Questions:

Are there specific optimizations for PatchCore or PyTorch that can help reduce memory usage?

Upvotes: 0

Views: 25

Answers (1)

deep-learnt-nerd
deep-learnt-nerd

Reputation: 189

Have you tried using mixed precision?

You can usually set it using precision="16-mixed" in a Lightning trainer. anomalib seem to have implemented a way to use it during deployment.

Upvotes: 0

Related Questions