Amine Jallouli
Amine Jallouli

Reputation: 3949

evaluation not done/ export no provided when training a TF model on AI Platform

I have TF a DNNRegressor to train Locally / Cloud in GCP (AI Platform).

If the training is done locally (through the gcloud command below), there will be checkpoints, eval folder and export folder.

gcloud ai-platform local train   \
  --module-name=babyweight.task   \
  --package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight   \
  --   \
  --nnsize 64 32   \
  --train-data-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/preproc/train.csv*   \
  --eval-data-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/preproc/eval.csv*

But, if the training is done in the cloud (through the gcloud command below), there will be only checkpoints. This is was not expected from my side. After checking the documentation, it is clear that the node chief does not run evaluation.

export BUCKET=$(gcloud info --format='value(config.project)')
export JOB_ID=appbabyweight_`date +%Y%m%d_%H%M%S`
gcloud ai-platform jobs submit training  $JOB_ID \
  --region=us-east1 \
  --module-name=babyweight.task \
  --package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight \
  --scale-tier=BASIC \
  --runtime-version=2.4 \
  --python-version=3.7 \
  --staging-bucket=gs://$BUCKET \
  -- \
  --nnsize 64 32 \
  --output-dir=gs://$BUCKET/$JOB_ID/output-dir \
  --train-data-path=gs://$BUCKET/babyweight/preproc/train.csv* \
  --eval-data-path=gs://$BUCKET/babyweight/preproc/eval.csv* \
  --train-records-count=10000 \
  --eval-steps=1000

To overcome this behavior, I defined --use-chief-in-tf-config=false as it can be seen below. Unfortunately, no eval folder and no export folder are provided after the training job is done.

export BUCKET=$(gcloud info --format='value(config.project)')
export JOB_ID=appbabyweight_`date +%Y%m%d_%H%M%S`
gcloud ai-platform jobs submit training  $JOB_ID \
  --region=us-east1 \
  --module-name=babyweight.task \
  --package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight \
  --use-chief-in-tf-config=false \
  --scale-tier=BASIC \
  --runtime-version=2.4 \
  --python-version=3.7 \
  --staging-bucket=gs://$BUCKET \
  -- \
  --nnsize 64 32 \
  --output-dir=gs://$BUCKET/$JOB_ID/output-dir \
  --train-data-path=gs://$BUCKET/babyweight/preproc/train.csv* \
  --eval-data-path=gs://$BUCKET/babyweight/preproc/eval.csv* \
  --train-records-count=10000 \
  --eval-steps=1000

Any suggestion?

Upvotes: 0

Views: 80

Answers (1)

marian.vladoi
marian.vladoi

Reputation: 8056

I believe you should check this related SO question:

ai-platform: No eval folder or export folder in outputs when running TensorFlow 2.1 training job using Estimatorssearchlucky

It is explained that Chief by default does not act as an evaluator.

In order to get evaluation data, you have to update your config yaml to explicitly allocate an evaluator.

https://cloud.google.com/ai-platform/training/docs/distributed-training-details#tf-config-format

Upvotes: 1

Related Questions