liuyl
liuyl

Reputation: 175

How to run tensorflow GPU container on google compute engines?

I am trying to run a tensorflow container on google compute engines with GPU accelerators.

Tried the command

gcloud compute instances create-with-container job-name \
  --machine-type=n1-standard-4 \
  --accelerator=type=nvidia-tesla-k80 \
  --image-project=deeplearning-platform-release \
  --image-family=common-container \
  --container image gcr/io/my-container \
  --container-arg="--container-arguments=xxxx"

But got warning

WARNING: This container deployment mechanism requires a Container-Optimized OS image in order to work. Select an image from a cos-cloud project (cost-stable, cos-beta, cos-dev image families).

I also tried system images from cos-cloud project, which seems doesn't have CUDA driver because tensorflow logs warning cuInit failed.

Wonder what's the correct way to run a tensorflow container on google compute engines with GPU support?

Upvotes: 5

Views: 1183

Answers (2)

northtree
northtree

Reputation: 9255

You could docker run your container within startup-script of deeplearningvm.


gcloud beta compute instances create deeplearningvm-$(date +"%Y%m%d-%H%M%S") \
--zone=us-central1-c \
--machine-type=n1-standard-8 \
--subnet=default \
--service-account=<your google service account> \
--scopes='https://www.googleapis.com/auth/cloud-platform' \
--accelerator=type=nvidia-tesla-k80,count=1 \
--image-project=deeplearning-platform-release \
--image-family=tf-latest-gpu \
--maintenance-policy=TERMINATE \
--metadata=install-nvidia-driver=True,startup-script='#!/bin/bash

# Check the driver until installed
while ! [[ -x "$(command -v nvidia-smi)" ]];
do
  echo "sleep to check"
  sleep 5s
done
echo "nvidia-smi is installed"

gcloud auth configure-docker
echo "Docker run with GPUs"
docker run --gpus all --log-driver=gcplogs --rm gcr.io/<your container>

echo "Kill VM $(hostname)"
gcloud compute instances delete $(hostname) --zone \
$(curl -H Metadata-Flavor:Google http://metadata.google.internal/computeMetadata/v1/instance/zone -s | cut -d/ -f4) -q

'

Since it takes several minutes to install nvidia driver, you have to wait until installed before start your container. https://cloud.google.com/ai-platform/deep-learning-vm/docs/tensorflow_start_instance#creating_a_tensorflow_instance_from_the_command_line

Compute Engine loads the latest stable driver on the first boot and performs the necessary steps (including a final reboot to activate the driver). It may take up to 5 minutes before your VM is fully provisioned. In this time, you will be unable to SSH into your machine. When the installation is complete, to guarantee that the driver installation was successful, you can SSH in and run nvidia-smi.

Upvotes: 2

Ernesto U
Ernesto U

Reputation: 806

Have you consider Cloud TPU on GKE?

This page describes how to setup a GKE cluster with GPU

Upvotes: 0

Related Questions