How to observe and control how sagemaker multimodel server loads models in memory

Question

I am evaluating SageMaker Multi Model Server (MMS) as an option to host large number of models for inference. I have successfully built the container according to the SageMaker BYOC MMS instruction. I can invoke inference and the models work fine on SageMaker.

I run my tests on the smallest instance type available ml.t2.medium. The MMS is described as downloading models from S3, loading them to container, and loading the models to memory as needed. Then offloading from memory when low on memory.

In my experiment the MMS constantly reports the CloudWatch metric of LoadedModelCount at around 8-10. Even if I run inference on much larger set of models. If I keep the number of models invoked small, the inference call takes about 0.1 seconds. If I go over the LoadedModelCount, the inference time goes up to about 2s.

So my guess is that the SageMaker MMS is unloading models from memory, and loading new models into memory, basically memory-swapping constantly. I put logging into my MMS model handler to show that it keeps initializing the handler for different models over and over when this happens.

Also the CloudWatch metric DiskUtilization keeps going up with more models invoked, which I expect means it loads the models from S3 into container disk. The other metrics (memory and loaded models) on the other hand plateau after the 8-10 loaded models, with only minor changes up and down. Which further seems to support this theory that it swaps constantly from container disk to memory.

I cannot find a way to see when MMS is actually unloading a model from memory, or when it loads a different one. Also, I cannot see what threshold is it using to unload models, as the CloudWatch MemoryUtilization metric from the SageMaker instance never goes above 45, which I guess means 45% of memory is used at most. This seems like a very low threshold, so I would expect to find a way to configure it, but have not found it.

Question 1: How can I observe when MMS is unloading models from memory, and loading new ones?

Question 2: How can I control the memory thresholds (or whatever MMS uses) that define when to unload the models?

How to observe and control how sagemaker multimodel server loads models in memory

Answers (1)

Related Questions