how does scaling policy work with sagemaker endpoints?

Question

based on the docs provided here , https://github.com/aws/amazon-sagemaker-examples/blob/main/async-inference/Async-Inference-Walkthrough.ipynb I'm defining an autoscaling policy for my sagemaker endpoint ( sample code below) . I have specified from 0 - 3 capacity for the scalable config. I understand, this will only scale to maximum capacity of 3 when needed and or else it will scale down to 0 after a period. from cost perspective, when it is scale down to 0 , is there any charges? also, what how does it handle where , it scales to maximum capacity and there are more requests in the queue?

client = boto3.client("application-autoscaling")
resource_id = ("endpoint/" + endpoint_name + "/variant/" + "variant1")

# Configure Autoscaling on asynchronous endpoint down to zero instances
response = client.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId=resource_id,
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=0,
    MaxCapacity=3,
)

response = client.put_scaling_policy(
    PolicyName="Invocations-ScalingPolicy",
    ServiceNamespace="sagemaker", 
    ResourceId=resource_id,  # Endpoint name
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",  # only Instance Count
    PolicyType="TargetTrackingScaling",  
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 5.0,  # The target value for the 
        SageMakerVariantInvocationsPerInstance
        "CustomizedMetricSpecification": {
            "MetricName": "ApproximateBacklogSizePerInstance",
            "Namespace": "AWS/SageMaker",
            "Dimensions": [{"Name": "EndpointName", "Value": endpoint_name}],
            "Statistic": "Average",
        },
        "ScaleInCooldown": 600,   
        "ScaleOutCooldown": 300
    },
)

how does scaling policy work with sagemaker endpoints?

Answers (1)

Related Questions