user1747036
user1747036

Reputation: 524

Using CPU vs GPU to train a model - Speed vs memory

I am trying to train the model found at https://github.com/silicon-valley-data-science/RNN-Tutorial With a dataset generated through https://github.com/jupiter126/Create_Speech_Dataset (around 340000 small wav audio samples with transcripts).

When I train with GPU the training goes relatively fast, however I can't set batch_train_size above 25 without reaching OOM.
When I train with CPU, training is much slower, but I can easily set batch_train_size to 250 (probably up to 700 but didn't try yet).

I am confused on how the small batch size limit on GPU might affect training quality, or if raising the raising the number of epoch might cancel out that effect...
In other words, 25 samples on 10000 epoch or 500 samples on 500 epoch?

GPU is GTX 1060 with 6Gb ram, CPU is dual XEON 2630l v4 (2*10 hyperthreaded cores at 1.7Ghz) with 128Gb ram.

Upvotes: 2

Views: 4160

Answers (3)

TQA
TQA

Reputation: 267

This paper researches the relation of batch size and learning rate. Instead of decaying the learning rate, they increase the batch size by the same factor.

It reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times.

In short, if you use a bigger batch size, you can use a larger learning rate to reduce training time.

Upvotes: 1

David Parks
David Parks

Reputation: 32051

I've experimented with batch sizes in a project using a convolutional neural network and found something interesting: Batch size is a regularizer.

I had a network (convolutional in this case, but the point carries over to your case) and I had both a small and large dataset.

I did a comprehensive hyperparameter search over 20 hyperparameters in the network (days worth of training to do this), including batch size, L2 regularization, dropout, convolution parameters, neurons in fully connected layers, etc. The hyperparameter search was judged on a held out validation dataset.

When I had the small dataset (10's of thousands of samples) the hyperparameter search favored more regularization for L2 and dropout, those values produced better results on the validation set. It also favored a lower batch size.

When I had a large dataset (millions of samples) the dataset itself was sufficiently large to avoid overfitting. The hyperparameter search favored lower L2 regularization and dropout (it chose a 98% keep probability for dropout in fact). And this time it favored a large batch size.

That was unexpected, I haven't seen much literature that would cast batch size as a regularization parameter, but the results were pretty clear to me in those experiments.

So to your point directly, it will probably make a small difference but you can probably compensate with other regularization techniques. You'll get far more mileage by training faster and testing more hyperparameter combinations than focusing on batch size and sacrificing your ability to run a lot of experimentation.

Upvotes: 2

Autonomous
Autonomous

Reputation: 9075

When you have a large batch size, you can have a better estimate of the gradients and vice-versa for small batch size. However, a little noisy gradients are not always bad. They help the network get out of a (possibly) bad local minima, or in other words, it gives the optimizer a chance to explore other local minimas, that might be better. As far as I know, there is no fool-prof way of knowing the optimal batch size. A thumb-rule is to consider batch sizes anywhere from 32 to 128, but again, this depends on the application, number of GPUs you are using etc.

Regarding speed, my guess is that GPU is always going to win even if the batch size 20 times smaller. You can time it by simply measuring how much time it takes to process a certain amount of samples (not batches). If you observe that batch size is affecting your validation accuracy and convergence, then you may think about shifting onto CPU.

Bottom line: do the above tests, but from the information available to me, I would say go with the GPU training.

Upvotes: 1

Related Questions