Reputation: 1291
My CUDA program crashed during execution, before memory was flushed. As a result, device memory remained occupied.
I'm running on a GTX 580, for which nvidia-smi --gpu-reset
is not supported.
Placing cudaDeviceReset()
in the beginning of the program is only affecting the current context created by the process and doesn't flush the memory allocated before it.
I'm accessing a Fedora server with that GPU remotely, so physical reset is quite complicated.
So, the question is - Is there any way to flush the device memory in this situation?
Upvotes: 126
Views: 376662
Reputation: 762
Expanding on the Python solution above, you can get further detail on the memory being cleared and print the outcome:
import torch
import gc
def print_gpu_memory():
allocated = torch.cuda.memory_allocated() / (1024**2)
cached = torch.cuda.memory_reserved() / (1024**2)
print(f"Allocated: {allocated:.2f} MB")
print(f"Cached: {cached:.2f} MB")
# Before clearing the cache
print("Before clearing cache:")
print_gpu_memory()
# Clearing cache
gc.collect()
torch.cuda.empty_cache()
# After clearing the cache
print("\nAfter clearing cache:")
print_gpu_memory()
Upvotes: 1
Reputation: 358
Normally I just use nvidia-smi, but for some problems it's not enough (something still in cuda memory)
The nvidia-smi kill all is:
nvidia-smi | grep 'python' | awk '{ print $5 }' | xargs -n1 kill -9
If you're still hitting unexpected memory errors or similar problems then try:
sudo fuser -v /dev/nvidia* | cut -d' ' -f2- | sudo xargs -n1 kill -9
Upvotes: 7
Reputation: 36
If all of this does not work, I found another answer here:
How to kill process on GPUs with PID in nvidia-smi using keyword?
nvidia-smi | grep 'python' | awk '{ print $X }' | xargs -n1 kill -9
Note that X (in the 'awk' expression) corresponds to the Xth column of your nvidia-smi command. If your nvidia-smi command looks like this, you should then replace X by 5.
Upvotes: 0
Reputation: 71
If you have the problem that after killing one process the next starts (Comment)- like for example when you have a bash script that calls multiple python scripts and you want to kill them but can't find its PID you can use ps -ef
where you'll find the PID of your "problematic" process and also its PPID (parent PID). Use kill PPID
or kill -9 PPID
or sudo kill PPID
to stop the processes.
Upvotes: 0
Reputation: 1
I just started a new terminal and closed the old one and it worked out pretty well for me.
Upvotes: -2
Reputation: 35
For OS: UBUNTU 20.04 In the terminal type
nvtop
If the direct killing of consuming activity doesn't work then find and note the exact number of activity PID with most GPU usage.
sudo kill PID -number
Upvotes: 1
Reputation: 394
One can also use nvtop
, which gives an interface very similar to htop
, but showing your GPU(s) usage instead, with a nice graph.
You can also kill processes directly from here.
Here is a link to its Github : https://github.com/Syllo/nvtop
Upvotes: 16
Reputation: 72349
Although it should be unecessary to do this in anything other than exceptional circumstances, the recommended way to do this on linux hosts is to unload the nvidia driver by doing
$ rmmod nvidia
with suitable root privileges and then reloading it with
$ modprobe nvidia
If the machine is running X11, you will need to stop this manually beforehand, and restart it afterwards. The driver intialisation processes should eliminate any prior state on the device.
This answer has been assembled from comments and posted as a community wiki to get this question off the unanswered list for the CUDA tag
Upvotes: 19
Reputation: 427
for the ones using python:
import torch, gc
gc.collect()
torch.cuda.empty_cache()
Upvotes: 21
Reputation: 2281
First type
nvidia-smi
then select the PID that you want to kill
sudo kill -9 PID
Upvotes: 77
Reputation: 14094
check what is using your GPU memory with
sudo fuser -v /dev/nvidia*
Your output will look something like this:
USER PID ACCESS COMMAND
/dev/nvidia0: root 1256 F...m Xorg
username 2057 F...m compiz
username 2759 F...m chrome
username 2777 F...m chrome
username 20450 F...m python
username 20699 F...m python
Then kill the PID that you no longer need on htop
or with
sudo kill -9 PID.
In the example above, Pycharm was eating a lot of memory so I killed 20450 and 20699.
Upvotes: 215
Reputation: 137
I also had the same problem, and I saw a good solution in quora, using
sudo kill -9 PID.
see https://www.quora.com/How-do-I-kill-all-the-computer-processes-shown-in-nvidia-smi
Upvotes: 12
Reputation: 16394
on macOS (/ OS X), if someone else is having trouble with the OS apparently leaking memory:
Upvotes: 5