Reputation: 31
We run custom training jobs in Vertex AI. They are scheduled to run once a week using Airflow. These jobs are provisioned at the same time to Vertex AI but are running sequentially (one at a time). Each job takes around 10 minutes to run while the other 20+ jobs are pending.
We provision the custom jobs at the same time, we were at least expecting them to run by batches (5 at a time for example). But they're getting started sequentially. This is the Vertex AI config that we are using:
{
"displayName": display_name,
"trainingTaskDefinition": PREDICTION_JOB_SCHEMA_URI,
"trainingTaskInputs": {
"serviceAccount": VERTEX_SERVICE_ACCOUNT,
"workerPoolSpecs": [
{
"machineSpec": {
"machineType": "n2-standard-16",
},
"replicaCount": 1,
"pythonPackageSpec": {
"executorImageUri": PREDICTION_EXECUTOR_IMAGE_URI,
"packageUris": task_params["package_uris"],
"pythonModule": task_params["python_module"],
"args": task_params["args"],
"env": task_params["envs"],
},
}
],
},
}
Upvotes: 1
Views: 334