urig
urig

Reputation: 16831

GCP Vertex AI Training: Auto-packaged Custom Training Job Yields Huge Docker Image

I am trying to run a Custom Training Job in Google Cloud Platform's Vertex AI Training service.

The job is based on a tutorial from Google that fine-tunes a pre-trained BERT model (from HuggingFace).

When I use the gcloud CLI tool to auto-package my training code into a Docker image and deploy it to the Vertex AI Training service like so:

$BASE_GPU_IMAGE="us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest"
$BUCKET_NAME = "my-bucket"

gcloud ai custom-jobs create `
--region=us-central1 `
--display-name=fine_tune_bert `
--args="--job_dir=$BUCKET_NAME,--num-epochs=2,--model-name=finetuned-bert-classifier" `
--worker-pool-spec="machine-type=n1-standard-4,replica-count=1,accelerator-type=NVIDIA_TESLA_V100,executor-image-uri=$BASE_GPU_IMAGE,local-package-path=.,python-module=trainer.task"

... I end up with a Docker image that is roughly 18GB (!) and takes a very long time to upload to the GCP registry.

Granted the base image is around 6.5GB but where do the additional >10GB come from and is there a way for me to avoid this "image bloat"?

Please note that my job loads the training data using the datasets Python package at run time and AFAIK does not include it in the auto-packaged docker image.

Upvotes: 1

Views: 1102

Answers (1)

Kabilan Mohanraj
Kabilan Mohanraj

Reputation: 1906

The image size shown in the UI is the virtual size of the image. It is the compressed total image size that will be downloaded over the network. Once the image is pulled, it will be extracted and the resulting size will be bigger. In this case, the PyTorch image's virtual size is 6.8 GB while the actual size is 17.9 GB.

Also, when a docker push command is executed, the progress bars show the uncompressed size. The actual amount of data that’s pushed will be compressed before sending, so the uploaded size will not be reflected by the progress bar.

To cut down the size of the docker image, custom containers can be used. Here, only the necessary components can be configured which would result in a smaller docker image. More information on custom containers here.

Upvotes: 2

Related Questions