Reputation: 43
During the last week I have been trying to create a python experiment in Azure ML studio. The job consists on training a PyTorch (1.12.1) Neural Network using a custom environment with CUDA 11.6 for GPU acceleration. However, when attempting any movement operation I get a Runtime Error:
device = torch.device("cuda")
test_tensor = torch.rand((3, 4), device = "cpu")
test_tensor.to(device)
CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I have tried to set CUDA_LAUNCH_BLOCKING=1, but this does not change the result.
I have also tried to check if CUDA is available:
print(f"Is cuda available? {torch.cuda.is_available()}")
print(f"Which is the current device? {torch.cuda.current_device()}")
print(f"How many devices do we have? {torch.cuda.device_count()}")
print(f"How is the current device named? {torch.cuda.get_device_name(torch.cuda.current_device())}")
and the result is completely normal:
Is cuda available? True
Which is the current device? 0
How many devices do we have? 1
How is the current device named? Tesla K80
I also tried to downgrade and change the CUDA, Torch and Python versions, but this does not seem to affect the error.
As far as I found this error appears only when using a custom environment. When a curated environment is used, the scripts runs with no problem. However, as the script needs of some libraries like OpenCV, I am forced to use a custom DockerFile to create my environment, which you can read here for reference:
FROM mcr.microsoft.com/azureml/aifx/stable-ubuntu2004-cu116-py39-torch1121:biweekly.202301.1
USER root
RUN apt update
# Necessary dependencies for OpenCV
RUN apt install ffmpeg libsm6 libxext6 libgl1-mesa-glx -y
RUN pip install numpy matplotlib pandas opencv-python Pillow scipy tqdm mlflow joblib onnx ultralytics
RUN pip install 'ipykernel~=6.0' \
'azureml-core' \
'azureml-dataset-runtime' \
'azureml-defaults' \
'azure-ml' \
'azure-ml-component' \
'azureml-mlflow' \
'azureml-telemetry' \
'azureml-contrib-services'
COPY --from=mcr.microsoft.com/azureml/o16n-base/python-assets:20220607.v1 /artifacts /var/
RUN /var/requirements/install_system_requirements.sh && \
cp /var/configuration/rsyslog.conf /etc/rsyslog.conf && \
cp /var/configuration/nginx.conf /etc/nginx/sites-available/app && \
ln -sf /etc/nginx/sites-available/app /etc/nginx/sites-enabled/app && \
rm -f /etc/nginx/sites-enabled/default
ENV SVDIR=/var/runit
ENV WORKER_TIMEOUT=400
EXPOSE 5001 8883 8888
The code from the COPY
statement is a copy from one of the curated environments already predefined by Azure. I would like to highlight that I tried using the DockerFile given in one of these environments, without any modification and I get the same result.
Hence, my question is: How can I run a CUDA job using a custom environment? Is it possible?
I have tried to find a solution for this but I have not been able of finding any person with the same problem, nor any place in the Microsoft documentation where I could ask for this. I hope this is not duplicated and that any of you can help me out here.
Upvotes: 3
Views: 2297
Reputation: 14983
The problem is indeed sensitive and hard to debug. I suspect it has to do with the underlying hardware on which the docker container is deployed, not with the actual custom Docker container and its corresponding dependencies.
Since you have a Tesla K80, I suspect NC series video cards (upon which the environments are deployed).
As of writing this comment (10th of February 2023), the following observation is valid (https://learn.microsoft.com/en-us/azure/machine-learning/resource-curated-environments):
Note
Currently, due to underlying cuda and cluster incompatibilities, on NC series only AzureML-ACPT-pytorch-1.11-py38-cuda11.3-gpu with cuda 11.3 can be used.
Therefore, in my opinion, this can be traced back to the supported versions of CUDA + PyTorch and Python.
What I did in my case, I just installed my dependences via a .yaml
dependency file when creating the environment, starting from this base image:
Azure container registry
mcr.microsoft.com/azureml/curated/acpt-pytorch-1.11-py38-cuda11.3-gpu:9
You can start building your docker container from this URI as base image in order to work properly on Tesla K80s.
IMPORTANT NOTE : Using this base image did work in my case, I was able to train PyTorch models.
Upvotes: 5