satya
satya

Reputation: 11

Bitbucket self hosted runners in EC2 getting exited with exit code 137

We are making use of Bitbucket self hosted agent for pipelines for which we have installed Bitbucket runner in an EC2 machine of type Amazon Linux.

Pipelines were running properly for a month. Later pipeline containers started failing with Docker error 137.

Note: We scheduled yum updates to run every week in the runner machine.

Bitbucket pipeline failure

So, the following are the action items I performed

  1. Checked with Yum updates and system logs to see if anything is abnormal
  2. Restarted Docker service and rebooted the machine
  3. Monitored CPU and Memory during pipeline execution
  4. Upgraded bitbucket runner version to latest
  5. Increased machine CPU and Memory configuration
  6. Checked storage which is just 35%

Nothing helped me out. The pipelines continued to fail with the same error.

So, later, as the issue was unpredictable, upon help from AWS support team, I shifted all the runners to a ECS optimised AMI machine as this AMI is mostly designed for containers. Same thing happened with this machine too. After a month, pipelines started getting failed with the same error. This time, I re-investigated on the issue and checked manually with yum updates if there is any update which is getting mismatched with Docker.

Note: Runner container is up and running properly but the pipeline containers are getting failed with 137 which isn't because of OOM and Storage issue.

Later, we again moved it to a different Amazon Linux machine. Same issue for the third time after a month.

Nothing is helping out. Upon investigation, found that removing "Docker as a service" declaration in the pipeline is making the pipeline successful. The following is an example pipeline configuration.

 - step:
          name: generate dbt elementary report
          image:
            name: edr-build:latest
          clone:
            enabled: true
          runs-on:
            - 'linux'
            - 'self.hosted'
            - 'test'
          services:
            - docker
          caches:
            - docker
          script:
            - docker --version
            - docker build -t <tag> .
            - docker push to <ecr>

I cannot remove the services:docker section as I need to run docker commands in the pipeline.

And pipelines are getting failed all of a sudden. If there is really something to do with services:docker section or, if there is something to do with the runner machine, pipelines should fail with 137 error from the start.

I am not able to view any logs of the pipeline containers which are failing as they are getting killed immediately after the pipeline execution gets done. I am not able to list them as well with "docker ps -a". All the records of pipeline containers are getting erased which might be the functionality of Bitbucket runner.

Still not sure where to check and what to check with this error as this issue is occurring with multiple machines.

Note: Docker version is same in all the machines.

Please help me with this.

Upvotes: 1

Views: 172

Answers (0)

Related Questions