kg_sYy
kg_sYy

Reputation: 1215

How to observe and control how sagemaker multimodel server loads models in memory

I am evaluating SageMaker Multi Model Server (MMS) as an option to host large number of models for inference. I have successfully built the container according to the SageMaker BYOC MMS instruction. I can invoke inference and the models work fine on SageMaker.

I run my tests on the smallest instance type available ml.t2.medium. The MMS is described as downloading models from S3, loading them to container, and loading the models to memory as needed. Then offloading from memory when low on memory.

In my experiment the MMS constantly reports the CloudWatch metric of LoadedModelCount at around 8-10. Even if I run inference on much larger set of models. If I keep the number of models invoked small, the inference call takes about 0.1 seconds. If I go over the LoadedModelCount, the inference time goes up to about 2s.

So my guess is that the SageMaker MMS is unloading models from memory, and loading new models into memory, basically memory-swapping constantly. I put logging into my MMS model handler to show that it keeps initializing the handler for different models over and over when this happens.

Also the CloudWatch metric DiskUtilization keeps going up with more models invoked, which I expect means it loads the models from S3 into container disk. The other metrics (memory and loaded models) on the other hand plateau after the 8-10 loaded models, with only minor changes up and down. Which further seems to support this theory that it swaps constantly from container disk to memory.

I cannot find a way to see when MMS is actually unloading a model from memory, or when it loads a different one. Also, I cannot see what threshold is it using to unload models, as the CloudWatch MemoryUtilization metric from the SageMaker instance never goes above 45, which I guess means 45% of memory is used at most. This seems like a very low threshold, so I would expect to find a way to configure it, but have not found it.

Question 1: How can I observe when MMS is unloading models from memory, and loading new ones?

Question 2: How can I control the memory thresholds (or whatever MMS uses) that define when to unload the models?

Upvotes: 0

Views: 931

Answers (1)

Alaroff
Alaroff

Reputation: 2298

SageMaker will unload the least recently used model from memory to disk when memory is full and then delete from disk when disk cache is running out as well.

Unless the most recently used model is memory hungry and takes up all the memory of the instance, you should not get OOM exceptions.

As stated in the documentation:

Amazon SageMaker unloads unused models from the container when the instance is reaching memory capacity and more models need to be downloaded into the container. Amazon SageMaker also deletes unused model artifacts from the instance storage volume when the volume is reaching capacity and new models need to be downloaded. The first invocation to a newly added model takes longer because the endpoint takes time to download the model from S3 to the container's memory in instance hosting the endpoint

When a model is evicted from memory, the UnloadModel API on the inference container will be called. There's no indication on the InvokeEndpoint response itself that a model was evicted from memory during that request, but there is a ModelUnloadingTime CloudWatch metric that shows the time taken to unload a model during a request.

Upvotes: 1

Related Questions