How to avoid parallel requests to a pod belonging to a K8s service?

Question

I have an (internal) K8s deployment (Python, TensorFlow, Guinicorn) with around 60 replicas and an attached K8s service to distribute incoming HTTP requests. Each of these 60 pods can only actually process one HTTP request at a time (because of TensorFlow reasons). Processing a request takes between 1 and 4 seconds. If a second request is sent to a pod while it's still processing one, the second request is just queued in the Gunicorn backlog.

Now I'd like to reduce the probability of that queuing happening as much as possible, i.e., have new requests routed to one of the non-occupied pods as long as such a non-occupied one exists.

Round-robin would not do the trick, because not every request takes the same amount of time to answer (see above).

The Python application itself could make the endpoint used for the ReadinessProbe fail while it's processing a normal request, but as far as I understand, readiness probes are not meant for something that dynamic (K8s would need to poll them multiple times per second).

So how could I achieve the goal?

gohm&#39;c · Accepted Answer

...kubectl get service shows "TYPE ClusterIP" and "EXTERNAL-IP

Your k8s service will be routing requests at random in this case... obviously not good to your app. If you would like to stick with kube-proxy, you can switch to ipvs mode with sed. Here's a good article about it. Otherwise, you can consider using some sort of ingress controller like the one mentioned earlier on; ingress-nginx with "ewma" mode.

How to avoid parallel requests to a pod belonging to a K8s service?

Answers (2)

Related Questions