smallbirds
smallbirds

Reputation: 1067

How to avoid Azure DevOps Pipelines being abandoned

I am having a pipeline on ADO (Azure DevOps). Its quite simple and their are several steps 1,2,3,4 etc.

In one of the steps some code is pushed to SageMaker (AWS cloud solution) and is running there for some hours. I want the ADO pipeline to wait for that to be finished before moving on to the next step in the pipeline.

Its basically a python script like python deploy_to_sagemaker.py algorithm.

However its abandoning the job after around 40 minutes, probably because of CPU inactivity. Is there any way that I in my .yml file or something like that can tell the pipeline to wait for some hours no matter how little activity there is?

The error message is something like "We stopped hearing from agent id-xxx. Verify the agent machine is running and has a healthy network connection".

Upvotes: 1

Views: 4148

Answers (1)

Shayki Abramczyk
Shayki Abramczyk

Reputation: 41655

You need to increase the job timeout. from the docs:

Timeouts To avoid taking up resources when your job is unresponsive or waiting too long, it's a good idea to set a limit on how long your job is allowed to run. Use the job timeout setting to specify the limit in minutes for running the job. Setting the value to zero means that the job can run:

  • Forever on self-hosted agents
  • For 360 minutes (6 hours) on Microsoft-hosted agents with a public project and public repository
  • For 60 minutes on Microsoft-hosted agents with a private project or private repository (unless additional capacity is paid for)

The timeout period begins when the job starts running. It does not include the time the job is queued or is waiting for an agent.

The timeoutInMinutes allows a limit to be set for the job execution time. When not specified, the default is 60 minutes. When 0 is specified, the maximum limit is used (described above).

The cancelTimeoutInMinutes allows a limit to be set for the job cancel time when the deployment task is set to keep running if a previous task has failed. When not specified, the default is 5 minutes. The value should be in range from 1 to 35790 minutes.

jobs:
- job: Test
  timeoutInMinutes: 10 # how long to run the job before automatically cancelling
  cancelTimeoutInMinutes: 2 # how much time to give 'run always even if cancelled tasks' before stopping them

You can also set the timeout for each task individually - see task control options.

Upvotes: 1

Related Questions