Why does Torch use ~700mb of GPU memory when predicting with a 1.5mb network

Question

I am very new to Torch/CUDA, and I'm trying to test the small binary network (~1.5mb) from https://github.com/1adrianb/binary-face-alignment, but I keep running into 'out of memory' issues.

I am using a relatively weak GPU (NVIDIA Quadro K600) with ~900Mb of graphics memory on 16.04 Ubuntu with CUDA 10.0 and CudNN version 5.1. So I don't really care about performance, but I thought I would at least be able to run a small network for prediction, one image at a time (especially one that supposedly is aimed at those "with Limited Resources").

I managed to run the code in headless mode and checked the memory consumption to be around 700Mb, which would explain why it fails immediately when I have an X-server running which takes around 250Mb of GPU memory.

I also added some logs to see how far along main.lua I get, and it's the call output:copy(model:forward(img)) on the very first image that runs out of memory.

For reference, here's the main.lua code up until the crash:

    require 'torch'
    require 'nn'
    require 'cudnn'
    require 'paths'

    require 'bnn'
    require 'optim'

    require 'gnuplot'
    require 'image'
    require 'xlua'
    local utils = require 'utils'
    local opts = require('opts')(arg)

    print("Starting heap tracking")
    torch.setheaptracking(true)

    torch.setdefaulttensortype('torch.FloatTensor')
    torch.setnumthreads(1)
    -- torch.

    local model
    if opts.dataset == 'AFLWPIFA' then
        print('Not available for the moment. Support will be added soon')
        os.exit()
        model = torch.load('models/facealignment_binary_pifa.t7')
    else
        print("Loading model")
        model = torch.load('models/facealignment_binary_aflw.t7')
    end
    model:evaluate()

    local fileLists = utils.getFileList(opts)
    local predictions = {}
    local noPoints = 68
    if opts.dataset == 'AFLWPIFA' then noPoints = 34; end
    local output = torch.CudaTensor(1,noPoints,64,64)
    for i = 1, #fileLists do

        local img = image.load(fileLists[i].image)
        local originalSize = img:size()

        img = utils.crop(img, fileLists[i].center, fileLists[i].scale, 256)
        img = img:cuda():view(1,3,256,256)
        output:copy(model:forward(img))

So I have two major questions:

What tools are there for debugging memory usage in torch?
What are the plausible causes of this memory bloat?

It must be something more than just the network and the images that are loaded into the GPU. My best guess is that it's related to the LoadFileLists function, but I simply don't know enough torch or lua to go much further from there. Other answers indicate there really isn't support for showing how much memory a variable is taking.

Berriel · Accepted Answer

What usually consumes most of the memory are the activation maps (and gradients, when training). I am not familiar with this particular model and implementation, but I would say that you are using a "fake" binary network; by fake I mean they still use floating-point numbers to represent the binary values since most users are going to use their code on GPUs that do not fully support real binary operations. The authors even write in Section 5:

Performance. In theory, by replacing all floating-point multiplications with bitwise XOR and making use of the SWAR (Single instruction, multiple data within a register) [5], [6], the number of operations can be reduced up to 32x when compared against the multiplication-based convolution. However, in our tests, we observed speedups of up to 3.5x, when compared against cuBLAS, for matrix multiplications, a result being in accordance with those reported in [6]. We note that we did not conduct experiments on CPUs. However, given the fact that we used the same method for binarization as in [5], similar improvements in terms of speed, of the order of 58x, are to be expected: as the realvalued network takes 0.67 seconds to do a forward pass on a i7-3820 using a single core, a speedup close to x58 will allow the system to run in real-time. In terms of memory compression, by removing the biases, which have minimum impact (or no impact at all) on performance, and by grouping and storing every 32 weights in one variable, we can achieve a compression rate of 39x when compared against the single precision counterpart of Torch.

In this context, a small model (w.r.t. number of parameters or model size in MiB) does not necessarily mean low memory footprint. It is likely that all this memory is being used to store the activation maps in single- or double-precision.

Why does Torch use ~700mb of GPU memory when predicting with a 1.5mb network

Answers (1)

Related Questions