Kiryl A.
Kiryl A.

Reputation: 164

Why is there no specific directory that SageMaker was supposed to create automatically?

I'm trying to deploy my model (container) on AWS SageMaker. I've pushed the container to AWS ECR. Then I use an AWS Lambda that basically runs create_training_job() via the boto3 SageMaker client. It runs the container in train mode and puts the generated artifact to the S3. Like that:

sm = boto3.client('sagemaker')

sm.create_training_job(
        TrainingJobName=full_job_name,
        HyperParameters={
            'general': json.dumps(
                {
                    'environment': ENVIRONMENT,
                    'region': REGION,
                    'version': date_suffix,
                    'hyperparameter_tuning': training_params.get('hyperparameter_tuning', False),
                    'basket_analysis': training_params.get('basket_analysis', True),
                    'init_inventory_cache': training_params.get('init_inventory_cache', True),
                }
            ),
            'aws_profile': '***-dev',
            'db_config': json.dumps(database_mapping),
            'model_server_params': json.dumps(training_params.get('model_server_params', {}))
        },

        AlgorithmSpecification={
            'TrainingImage': training_image,
            'TrainingInputMode': 'File',
        },
        RoleArn=ROLE_ARN,
        OutputDataConfig={
            'S3OutputPath': S3_OUTPUT_PATH
        },
        ResourceConfig={
            'InstanceType': INSTANCE_TYPE,
            'InstanceCount': 1,
            'VolumeSizeInGB': 20,
        },
        # VpcConfig={
        #     'SecurityGroupIds': SECURITY_GROUPS.split(','),
        #     'Subnets': SUBNETS.split(',')
        # },
        StoppingCondition={
            'MaxRuntimeInSeconds': int(MAX_RUNTIME_SEC),
            #        'MaxWaitTimeInSeconds': 1800
        },
        Tags=[ ],
        EnableNetworkIsolation=False,
        EnableInterContainerTrafficEncryption=False,
        EnableManagedSpotTraining=False,
    )

I have a logger inside the container that says that opt/ml/input/config/hyperparameters.json now exists. It has been added by SageMaker. Fine.

But then, when I try to run the same container in serve mode (so basically to deploy it) I encounter that opt/ml/input/config/hyperparameters.json doesn't exist anymore. I deploy it this way:

     sm.create_model(
        ModelName=model_name,
        PrimaryContainer={
            'Image': training_image,
            'ModelDataUrl': model_artifact,
            'Environment': {
                'version': version
            }
        },
        ExecutionRoleArn=role_arn,
        Tags=[ ],
        # VpcConfig = {
        #     'SecurityGroupIds': os.environ['security_groups'].split(','),
        #     'Subnets': os.environ['subnets'].split(',')
        # }
    )

    sm.create_endpoint_config(
        EndpointConfigName=config_name,
        ProductionVariants=[
            {
                'VariantName': variant_name,
                'ModelName': model_name,
                'InitialInstanceCount': instance_count,
                'InstanceType': instance_type,
                'InitialVariantWeight': 1
            },
        ],
        Tags=[ ],
    )

    existing_endpoints = sm.list_endpoints(NameContains=endpoint_name)

    scaling_resource_id = f'endpoint/{endpoint_name}/variant/{variant_name}'

    if not existing_endpoints['Endpoints']:
        sm.create_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=config_name
        )
    else:
        if aas.describe_scalable_targets(
                ServiceNamespace='sagemaker',
                ResourceIds=[scaling_resource_id],
                ScalableDimension='sagemaker:variant:DesiredInstanceCount')['ScalableTargets']:
            aas.deregister_scalable_target(
                ServiceNamespace='sagemaker',
                ResourceId=scaling_resource_id,
                ScalableDimension='sagemaker:variant:DesiredInstanceCount'
            )

        sm.update_endpoint(
            EndpointName=endpoint_name,
            EndpointConfigName=config_name
        )

It is important since it seemed to be a convenient way to pass some parameters inside the container from outside (like management console).

I thought that this file/directory will still exist after the train. Any ideas?

Upvotes: 0

Views: 915

Answers (1)

bobbruno
bobbruno

Reputation: 94

tl;dr: two options:

  • Copy the hyperparameters.json file to /opt/ml/model in the training logic and it will be packed with the model artifacts;
  • Pass whatever parameters you want through the PrimaryContainer parameter's Environment property.

Long version:

That file, opt/ml/input/config/hyperparameters.json, (in fact the whole /opt/ml/input folder) is mounted on the training container when it is created. It is provided by SageMaker, based on information you provide, only for training purposes. SageMaker does not change your container in any way, and it doesn't preserve this or any configuration file it passes to the training job once training is done. If you want to pass parameters to the inference endpoint, that is not the way.

You could copy the hyperparameters.json file to the /opt/ml/model folder, and it'd be packed with the model in the model.tar.gz tarball. Your infrence code could then use that - but that's not the prescribed way to pass parameters to an endpoint, and it cause problems with your framework.

The generally prescribed way to pass parameters to SageMaker endpoints is through the environment. If you check the boto3 docs for create_model, you'll see that there's an Environment key within the PrimaryContainer parameter (also for each of the Containers parameter). In fact, your code above already uses that to pass a version parameter. You should use that to pass any parameters to your model and, from there, to the endpoint based on it.

Upvotes: 1

Related Questions