ozil
ozil

Reputation: 679

how does scaling policy work with sagemaker endpoints?

based on the docs provided here , https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb I'm defining an autoscaling policy for my sagemaker endpoint ( sample code below) . I have specified from 0 - 3 capacity for the scalable config. I understand, this will only scale to maximum capacity of 3 when needed and or else it will scale down to 0 after a period. from cost perspective, when it is scale down to 0 , is there any charges? also, what how does it handle where , it scales to maximum capacity and there are more requests in the queue?

client = boto3.client("application-autoscaling")
resource_id = ("endpoint/" + endpoint_name + "/variant/" + "variant1")

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=3,
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker", 
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # only Instance Count
    PolicyType="TargetTrackingScaling",  
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the 
        SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,   
        "ScaleOutCooldown": 300
    },
)

Upvotes: 1

Views: 2747

Answers (1)

Raghu Ramesha
Raghu Ramesha

Reputation: 494

The autoscaling policy( and scale down to zero) is based on the TargetTrackingScalingPolicyConfiguration you define. In this particular scenario, we use the ApproximateBacklogSize to determine the scaling. For instance, if the number of requests in the managed queue is beyond a certain threshold we automatically step scale instance count which works very similar to AutoScaling in EC2. I'd recommend you to check the examples here in the HuffingFace-Async and CV-Async where we go over the different metrics you can use to set for AutoScaling and respective graphs of CloudWatch metrics.

Thanks, Raghu

Note: I work for AWS but my opinions are my own

Upvotes: 1

Related Questions