Tobias Hermann
Tobias Hermann

Reputation: 10956

How to avoid parallel requests to a pod belonging to a K8s service?

I have an (internal) K8s deployment (Python, TensorFlow, Guinicorn) with around 60 replicas and an attached K8s service to distribute incoming HTTP requests. Each of these 60 pods can only actually process one HTTP request at a time (because of TensorFlow reasons). Processing a request takes between 1 and 4 seconds. If a second request is sent to a pod while it's still processing one, the second request is just queued in the Gunicorn backlog.

Now I'd like to reduce the probability of that queuing happening as much as possible, i.e., have new requests routed to one of the non-occupied pods as long as such a non-occupied one exists.

Round-robin would not do the trick, because not every request takes the same amount of time to answer (see above).

The Python application itself could make the endpoint used for the ReadinessProbe fail while it's processing a normal request, but as far as I understand, readiness probes are not meant for something that dynamic (K8s would need to poll them multiple times per second).

So how could I achieve the goal?

Upvotes: 2

Views: 1987

Answers (2)

gohm'c
gohm'c

Reputation: 15548

...kubectl get service shows "TYPE ClusterIP" and "EXTERNAL-IP <none>

Your k8s service will be routing requests at random in this case... obviously not good to your app. If you would like to stick with kube-proxy, you can switch to ipvs mode with sed. Here's a good article about it. Otherwise, you can consider using some sort of ingress controller like the one mentioned earlier on; ingress-nginx with "ewma" mode.

Upvotes: 1

Harsh Manvar
Harsh Manvar

Reputation: 30180

Can't you implement the pub/sub or message broker in between?

saver data into a queue based on the ability you worker will fetch the message or data from queue and request will get processed.

You can use Redis for creating queues and in queue, you can use pub/sub also possible using the library. i used one before in Node JS however could be possible to implement the same using python also.

in 60 replicas ideally worker or we can say scriber will be running.

As soon as you get a request one application will publish it and scribers will be continuously working for processing those messages.

We also implemented one step further, scaling the worker count automatically depends on the message count in the queue.

This library i am using with the Node js : https://github.com/OptimalBits/bull

Upvotes: 2

Related Questions