Python3 Pytorch RuntimeError on GCP - no msg

My System

I am running a neural network training on using Python 3.6.9 with pytorch 1.6.0
I am using a google cloud platform N1 Server with a Tesla T4, 2 cores CPU, 12GB RAM. This is on an Ubuntu 18.04 image.

Problem

When my code reaches the training line I get the following RuntimeError with no real explanation that I can see:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/or/.local/share/virtualenvs/or-M3_AaJfY/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/or/my_model/train.py", line 88, in train_and_eval
    train(rank, epoch, hps, generator, optimizer_g, train_loader, logger, writer)
  File "/home/or/my_model/train.py", line 117, in train
    scaled_loss.backward()
  File "/home/or/.local/share/virtualenvs/or-M3_AaJfY/lib/python3.6/site-packages/torch/tensor.py", line 185, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/or/.local/share/virtualenvs/or-M3_AaJfY/lib/python3.6/site-packages/torch/autograd/__init__.py", line 127, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError

This happens while the 2 CPU cores are being used at 100% for a long while.
The RAM and GPU, though going up (as expected while training) do not reach close to their limit.
I checked journalctl to see if this was an operating system issue but there is nothing there. I also did not find anything relevant in the /var/log/ directory or using dmesg.
I would be happy to provide more log data but I am not aware (after searching) any python logs I can look at, or any other system logs.

Please let me know of ways I can get more information if you have any ideas.

The exact same code works 100% fine on other physical machines I have tested, and a GPU only version of it runs fine on another cloud computing provider

What I am looking for

Ways to get more information about this problem and figure out why it is happening.
Ways to fix this problem

Thanks in advance for your time and any help you may be able to provide.

Upvotes: 2

Answers (2)

Oha Noch

Reputation: 404

Anthony Leo thank you so much for your detailed answer! Unfortunately this ended up being a problem with one of the modules I installed while setting up my server.
This did not end up being a problem of the server itself or of my code, I just installed a module incorrectly while setting up.

I am sorry for all the time other people spent on this issue.

Upvotes: 1

Anthony Leo

Reputation: 498

In terms of finding ways to get more information about the problem to figure out why this issue is happening. You can break down your troubleshooting into two layers:

Application Layer
GCE VM Instance Layer

For the most part, we will focus on looking at the GCE VM Instance Layer as their could be more information to be found at this location as these logs will show us information if the GCE instacne was running into issues before or after the stacktrace you have presented above.

Depending on your VM instance configuration, it would be suggested to install a Cloud Logging Agent onto the affect VM so that we can gather logs from inside the VM. This will also be helpful as these logs that are gathered are accurate.

Once you have the agent installed and running on the VM, we can direct ourselves to the Logs Explorer console on GCP that will allow us to view the two types of logs from the layers mention above. Keep in mind that by this step, you should re-runed your application and its scenarios.

From here on-out, we can view all logs and sort them based on timestamps, resource type and etc in Logs Explorer using logs queries. This would be a great place to start as it will allow you to view all logs in chronological order, in terms of the logs leading to the error. This should allow you to find out why this issue is happening and/or give clues how to fix this problem.

Upvotes: 0

Python3 Pytorch RuntimeError on GCP - no msg

My System

Problem

What I am looking for

Answers (2)

Related Questions