What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

Question

I'm deploying helm charts using community.kubernetes.helm with ease but I've run into conditions where the connection is refused and it's not clear how best to configure a retries/wait/until. I've run into a case where every now and then, helm can't communicate with the cluster, here's an example (dns/ip faked) showing that the issue is as simple as not being able to connect to the cluster:

fatal: [localhost]: FAILED! => {"changed": false, "command": "/usr/local/bin/helm --kubeconfig /var/opt/kubeconfig --namespace=gpu-operator list --output=yaml --filter gpu-operator", "msg": "Failure when executing Helm command. Exited 1. stdout: stderr: Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused ", "stderr": "Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused ", "stderr_lines": ["Error: Kubernetes cluster unreachable: Get "https://ec2-host/k8s/clusters/c-xrsqn/version?timeout=32s": dial tcp 192.168.1.1:443: connect: connection refused"], "stdout": "", "stdout_lines": []}

In my experience, I have seen that try/retry will work. I agree that it would be ideal to figure out why I can't connect to the service, but it would be even more ideal to work around this by taking advantage of a catch all "until" block that tries this block until it works or gives up after N tries while taking a break of N seconds.

Here's an example of the ansible block:

- name: deploy Nvidia GPU Operator
  block:
    - name: deploy gpu operator
      community.kubernetes.helm:
        name: gpu-operator
        chart_ref: "{{ CHARTS_DIR }}/gpu-operator"
        create_namespace: yes
        release_namespace: gpu-operator
        kubeconfig: "{{ STATE_DIR }}/{{ INSTANCE_NAME }}-kubeconfig"
      until: ??? 
      retries: 5
      delay: 3
  when: GPU_NODE is defined

I would really appreciate any suggestions/pointers.

What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

Answers (1)

Related Questions

What&#39;s the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?

Answers (1)

Related Questions

What's the correct way to configure ansible tasks to make helm deployments fault tolerant of internet connection issues?