Reputation: 48576
I'm trying to get a rough handle on the GPU memory footprint of my TensorFlow deep learning models, and am relying on a heuristic I've found with suggests:
The largest bottleneck to be aware of when constructing ConvNet architectures is the memory bottleneck. Many modern GPUs have a limit of 3/4/6GB memory, with the best GPUs having about 12GB of memory. There are three major sources of memory to keep track of:
From the intermediate volume sizes: These are the raw number of activations at every layer of the ConvNet, and also their gradients (of equal size). Usually, most of the activations are on the earlier layers of a ConvNet (i.e. first Conv Layers). These are kept around because they are needed for backpropagation, but a clever implementation that runs a ConvNet only at test time could in principle reduce this by a huge amount, by only storing the current activations at any layer and discarding the previous activations on layers below.
From the parameter sizes: These are the numbers that hold the network parameters, their gradients during backpropagation, and commonly also a step cache if the optimization is using momentum, Adagrad, or RMSProp. Therefore, the memory to store the parameter vector alone must usually be multiplied by a factor of at least 3 or so.
Every ConvNet implementation has to maintain miscellaneous memory, such as the image data batches, perhaps their augmented versions, etc.
Once you have a rough estimate of the total number of values (for activations, gradients, and misc), the number should be converted to size in GB. Take the number of values, multiply by 4 to get the raw number of bytes (since every floating point is 4 bytes, or maybe by 8 for double precision), and then divide by 1024 multiple times to get the amount of memory in KB, MB, and finally GB. If your network doesn’t fit, a common heuristic to “make it fit” is to decrease the batch size, since most of the memory is usually consumed by the activations.
But I'm unsure of a few things:
Upvotes: 1
Views: 1270
Reputation: 27050
You got a single model that is trained using batches of samples.
A single batch is composed by multiple inputs.
These inputs are processed in parallel using the model.
Thus, if your batch contains a certain number of elements, every element is transferred from the CPU (where the input Queues are) to the GPU.
The GPU, hence, computes using the model at the state t
(think about this model as the model with its parameters freezed at the time step t
) the forward pass for every single element of the input batch.
Then, the network results are accumulated in a vector and now the backpropagation step is computed.
The gradients are thus calculated (backward pass) for every single element of the batch using the model at the time t
(again), accumulated in a vector and averaged.
Using this average the model parameters are updated and the model enters in the state t+1
.
As a rule of thumb everything that's sequential by its nature it's on the CPU (think about input threads, queue, processing of single input values, ...). However, everything that the network should process is then transferred from the CPU to the GPU.
The miscellaneous part is a little bit confusing. I guess the author is talking about the data augmentation and the fact that a single input can be augmented in infinite ways. Thus you have to take into account that if you're applying transformations to a batch of input (eg, random brightness to a whole batch of images) these transformations to be computed need to be transferred from the CPU to the GPU and the augmented versions should be stored in the GPU memory before processing. However, the transfer operation it would be done the same, you just loose some computation time (for the preprocesing of course), the allocated memory will the the same
Upvotes: 1