Limit FastAPI/gunicorn/... worker to certain endpoints to save memory

Question

I have a FastAPI application with multiple endpoints, and each endpoint uses certain memory intensive objects (ML models). This works fine when I only have one worker, but I am worried about memory usage (and to a lesser extent startup time) when I scale to multiple workers.

Is there a way to limit certain workers to certain endpoints only? Then I would only load the objects required for the respective endpoint.

Specifically, assume I have two endpoints using 2 GB each. If I scale to four workers, I need 2 GB x 2 x 4 = 16 GB.

If I say the first two workers only serve the first endpoint, and the second two workers serve the second endpoint, every process only needs to load one of the models! So I would have 2 GB x 4 = 8 GB. This assumes of course that the load is approximately equal, which is the case here.

Alternatives:

One option would be a microservice architecture, where each endpoint is its own application. However, this only came up because I am trying to move away from microservices, because I had problems with the reliability of such an architecture. (E.g. need to have some kind of scheduler, need to forward the HTTP endpoints, high latency due to multiple layers of forwarding, and finally some of the endpoints are little more than return calculation(huge_object[param])).
The option to share the data among workers does not seem technically possible in the general case

Limit FastAPI/gunicorn/... worker to certain endpoints to save memory

Answers (0)

Related Questions