Reputation: 19
I deploy machine learning models using Triton Server on the web with AWS SageMaker. I use asynchronous endpoints because they support a higher payload (1GB) compared to real-time endpoints (6MB payload). I also apply an auto-scaling policy on the endpoint to scale up if traffic increases and scale down to zero when no one is using it.
In this regard, I have two questions. As a beginner in MLOps and AWS, I’d appreciate any insights from those with experience:
When no one is using the endpoint, it does scale down to zero instances, but I’m still billed for the time the endpoint is in service with zero instances running (I monitored it and it cost around $0.50 per 6 hours). When someone invokes the endpoint, it takes about 10 minutes (approximately 10 times) for one instance to start running (I know this is called a cold start in serverless computing). My question is: since SageMaker charges per computing unit of the instance (which is the major part of the bill), what would I pay if I kept the endpoint running on one instance as a minimum number?
The instance being used is: ml.g4dn.xlarge.
Upvotes: 0
Views: 25