Reputation: 3949
I have TF a DNNRegressor to train Locally / Cloud in GCP (AI Platform).
If the training is done locally (through the gcloud
command below), there will be checkpoints, eval folder and export folder.
gcloud ai-platform local train \
--module-name=babyweight.task \
--package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight \
-- \
--nnsize 64 32 \
--train-data-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/preproc/train.csv* \
--eval-data-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/preproc/eval.csv*
But, if the training is done in the cloud (through the gcloud
command below), there will be only checkpoints. This is was not expected from my side. After checking the documentation, it is clear that the node chief
does not run evaluation.
export BUCKET=$(gcloud info --format='value(config.project)')
export JOB_ID=appbabyweight_`date +%Y%m%d_%H%M%S`
gcloud ai-platform jobs submit training $JOB_ID \
--region=us-east1 \
--module-name=babyweight.task \
--package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight \
--scale-tier=BASIC \
--runtime-version=2.4 \
--python-version=3.7 \
--staging-bucket=gs://$BUCKET \
-- \
--nnsize 64 32 \
--output-dir=gs://$BUCKET/$JOB_ID/output-dir \
--train-data-path=gs://$BUCKET/babyweight/preproc/train.csv* \
--eval-data-path=gs://$BUCKET/babyweight/preproc/eval.csv* \
--train-records-count=10000 \
--eval-steps=1000
To overcome this behavior, I defined --use-chief-in-tf-config=false
as it can be seen below. Unfortunately, no eval folder and no export folder are provided after the training job is done.
export BUCKET=$(gcloud info --format='value(config.project)')
export JOB_ID=appbabyweight_`date +%Y%m%d_%H%M%S`
gcloud ai-platform jobs submit training $JOB_ID \
--region=us-east1 \
--module-name=babyweight.task \
--package-path=/home/jupyter/end-to-end-ml/examples/e2e-ml-model-ex02/app/babyweight \
--use-chief-in-tf-config=false \
--scale-tier=BASIC \
--runtime-version=2.4 \
--python-version=3.7 \
--staging-bucket=gs://$BUCKET \
-- \
--nnsize 64 32 \
--output-dir=gs://$BUCKET/$JOB_ID/output-dir \
--train-data-path=gs://$BUCKET/babyweight/preproc/train.csv* \
--eval-data-path=gs://$BUCKET/babyweight/preproc/eval.csv* \
--train-records-count=10000 \
--eval-steps=1000
Any suggestion?
Upvotes: 0
Views: 80
Reputation: 8056
I believe you should check this related SO question:
It is explained that Chief by default does not act as an evaluator.
In order to get evaluation data, you have to update your config yaml to explicitly allocate an evaluator.
https://cloud.google.com/ai-platform/training/docs/distributed-training-details#tf-config-format
Upvotes: 1