Reputation: 54268
Here is the output of nvidia-smi
when the GPU-intensive codes are running:
$ nvidia-smi
Mon Feb 13 10:20:42 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11 Driver Version: 525.60.11 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:81:00.0 Off | Off |
| 0% 47C P8 28W / 450W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:C1:00.0 Off | Off |
| 0% 36C P8 29W / 450W | 8MiB / 24564MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1947 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1947 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
it shows 0% utilization for both units and the script does not show up in the Processes list, which means that the GPU is idle.
The check on the running device is:
import torch
# show PyTorch version
print(torch.__version__)
# Check if CUDA is available
print('Is CUDA available?', torch.cuda.is_available())
And the output of the above is:
1.13.1+cu117
Is CUDA available? True
The concerned GPU codes are available here. Sorry for not pasting the whole chunk of codes here, as it is too long.
UPDATE: The concerned code is as follows:
class Net(nn.Module):
device = torch.device("cuda") # I added this
def __init__(self, n_vocab, embedding_dim, hidden_dim, dropout=0.2):
super(Net, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim # dim = dimension
embedding_dim.to(device) # I added this
self.embeddings = nn.Embedding(n_vocab, embedding_dim)
# LSTM Layer (input_size, hidden_size)
self.lstm = nn.LSTM(embedding_dim, hidden_dim, dropout=dropout)
# Fully connected layer, change "Hidden State" Linear to output
self.hidden2out = nn.Linear(hidden_dim, n_vocab)
def forward(self, seq_in):
seq_in.to(device) # I added this
embeddings = self.embeddings(seq_in.t())
lstm_out, _ = self.lstm(embeddings)
ht = lstm_out[-1]
out = self.hidden2out(ht)
return out
The RuntimeError
occurs at the line embeddings = self.embeddings(seq_in.t())
.
The full RuntimeError
is as follows:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper__index_select)
How can I modify the Net
class in order to make it working again?
Upvotes: 1
Views: 300
Reputation: 392
General Information: In pytorch, each tensor can be on one of the different "devices". Most operations preformed on a tensor will be done using the compute capabilities of the associated device. If an operation takes multiple inputs, in most cases all tensor inputs have to be on the same device
So, for your model to use a GPU, not only do you need to have a GPU available, but also all data has to be explicitly moved to the GPU.
If you use a framework such as pytorch lightning this might be partially done automatically.
Otherwise, the basic recipe is:
model=Network() # create instance of your model
model=model.to(device='cuda') #move model parameters to gpu
for batch in dataloader:
x,y,_* = batch # unpack data and labels of the batch
x = x.to(device='cuda') # move data to GPU
y = y.to(device='cuda') # move labels to GPU
prediction = model(x) #apply model
loss=lossfunction(x,y)
....
So, a reason why code might not be using the GPU without any error messages is because the data is never moved to a GPU!
You can check where the tensor x
resides it by printing x.device
Edit: As a rule of thumb, I would advice against moving tensors around inside the forward or init function of your Network if you can avoid it and instead:
Upvotes: 2