Reputation: 10099
Im trying to train a network but i get, I set my batch-size as 300 and i get this error,but even if i reduce this to 100 i still get this error,and more frustratingly for running 10 epoch on ~1200 images it takes about 40 minutes,any suggestions what is going wrong and how may i speed the process! Any tips will be extremely helpful,Thanks in advance.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-31-3b43ff4eea72> in <module>()
5 labels = Variable(labels).cuda()
6
----> 7 optimizer.zero_grad()
8 outputs = cnn(images)
9 loss = criterion(outputs, labels)
/usr/local/lib/python3.5/dist-packages/torch/optim/optimizer.py in zero_grad(self)
114 if p.grad is not None:
115 if p.grad.volatile:
--> 116 p.grad.data.zero_()
117 else:
118 data = p.grad.data
RuntimeError: cuda runtime error (2) : out of memory at /pytorch /torch/lib/THC/generic/THCTensorMath.cu:35`
Even though my GPU's are free
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108... Off | 00000000:05:00.0 Off | N/A |
| 23% 18C P8 15W / 250W | 10864MiB / 11172MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 108... Off | 00000000:08:00.0 Off | N/A |
| 23% 20C P8 15W / 250W | 10MiB / 11172MiB | 0% Default
+-------------------------------+----------------------+---------------
Upvotes: 5
Views: 8213
Reputation: 46291
Fairly general question. Here is how I would think on this problem.
Try to set batch size (number of batches) to 1. If this fixed the problem you may try to find optimal batch size.
If even for bs=1
you get "RuntimeError: cuda runtime error (2) : out of memory" :
Do not use linear layers that are too large. A linear layer nn.Linear(m, n) uses O(nm)O(nm)O(nm) memory: that is to say, the memory requirements of the weights scales quadratically with the number of features considering also the gradients.
Do not accumulate history across your training loop. If you sum the loss recursively inside a loop 10000 or more your back-propagation evaluation will be huge; taking lot of memory.
Delete tensors you don't need with del
explicitly.
Run ps -elf | grep python
and python processes on your GPU kill -9 [pid]
if you have doubts some other Python process is eating your memory.
Upvotes: 3