Reputation: 2807
I have 2-3 machine learning models I am trying to host via Kubernetes. I don't get much usage on the models right now, but they are critical and need to be available when called upon.
I am providing access to the models via a flask app and am using a load balancer to route traffic to the flask app.
Everything typically works fine since requests are only made intermittently, but I've come to find that if multiple requests are made at the same time my pod crashes due to OOM. Isn't this the job of the load balancer? To make sure requests are routed appropriately? (in this case, route the next request after the previous ones are complete?)
Below is my deployment:
apiVersion: v1
kind: Service
metadata:
name: flask-service
labels:
run: flask-service
spec:
selector:
app: flask
ports:
- protocol: "TCP"
port: 5000
targetPort: 5000
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: flask
spec:
selector:
matchLabels:
app: flask
replicas: 1
template:
metadata:
labels:
app: flask
spec:
containers:
- name: flask
imagePullPolicy: Always
image: gcr.io/XXX/flask:latest
ports:
- containerPort: 5000
resources:
limits:
memory: 7000Mi
requests:
memory: 1000Mi
Upvotes: 0
Views: 850
Reputation: 129065
Isn't this the job of the load balancer? To make sure requests are routed appropriately?
Yes, you are right. But...
replicas: 1
You only use a single replica, so the load balancer has no options to route to other instances of your application. Give it multiple instances.
I've come to find that if multiple requests are made at the same time my pod crashes due to OOM
It sounds like your application has very limited resources.
resources:
limits:
memory: 7000Mi
requests:
memory: 1000Mi
When your application uses more than 7000Mi
it will get OOM-killed (also consider increase request value). If your app need more, you can give it more memory (scale vertically) or add more instances (scale horizontally).
Everything typically works fine since requests are only made intermittently
Consider using Horizontal Pod Autoscaler, it can scale up your application to more instances when you have more requests and scale down when there is less requests. This can be based on memory or CPU usage for example.
route the next request after the previous ones are complete?
If this is the behavior you want, then you need to use a queue e.g. RabbitMQ or Kafka to process your requests one at a time.
Upvotes: 1