Reputation: 175
I am trying to run a tensorflow container on google compute engines with GPU accelerators.
Tried the command
gcloud compute instances create-with-container job-name \
--machine-type=n1-standard-4 \
--accelerator=type=nvidia-tesla-k80 \
--image-project=deeplearning-platform-release \
--image-family=common-container \
--container image gcr/io/my-container \
--container-arg="--container-arguments=xxxx"
But got warning
WARNING: This container deployment mechanism requires a Container-Optimized OS image in order to work. Select an image from a cos-cloud project (cost-stable, cos-beta, cos-dev image families).
I also tried system images from cos-cloud
project, which seems doesn't have CUDA driver because tensorflow logs warning cuInit failed
.
Wonder what's the correct way to run a tensorflow container on google compute engines with GPU support?
Upvotes: 5
Views: 1183
Reputation: 9255
You could docker run
your container within startup-script
of deeplearningvm
.
gcloud beta compute instances create deeplearningvm-$(date +"%Y%m%d-%H%M%S") \
--zone=us-central1-c \
--machine-type=n1-standard-8 \
--subnet=default \
--service-account=<your google service account> \
--scopes='https://www.googleapis.com/auth/cloud-platform' \
--accelerator=type=nvidia-tesla-k80,count=1 \
--image-project=deeplearning-platform-release \
--image-family=tf-latest-gpu \
--maintenance-policy=TERMINATE \
--metadata=install-nvidia-driver=True,startup-script='#!/bin/bash
# Check the driver until installed
while ! [[ -x "$(command -v nvidia-smi)" ]];
do
echo "sleep to check"
sleep 5s
done
echo "nvidia-smi is installed"
gcloud auth configure-docker
echo "Docker run with GPUs"
docker run --gpus all --log-driver=gcplogs --rm gcr.io/<your container>
echo "Kill VM $(hostname)"
gcloud compute instances delete $(hostname) --zone \
$(curl -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/zone -s | cut -d/ -f4) -q
'
Since it takes several minutes to install nvidia driver, you have to wait until installed before start your container. https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance#creating_a_tensorflow_instance_from_the_command_line
Compute Engine loads the latest stable driver on the first boot and performs the necessary steps (including a final reboot to activate the driver). It may take up to 5 minutes before your VM is fully provisioned. In this time, you will be unable to SSH into your machine. When the installation is complete, to guarantee that the driver installation was successful, you can SSH in and run nvidia-smi.
Upvotes: 2
Reputation: 806
Have you consider Cloud TPU on GKE?
This page describes how to setup a GKE cluster with GPU
Upvotes: 0