user12143596
user12143596

Reputation:

ValidationException error when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms

I'm trying to run a Lambda function to create a SageMaker training job using the same parameters as another previous training job. Here's my lambda function:

def lambda_handler(event, context):
    training_job_name = os.environ['training_job_name']
    sm = boto3.client('sagemaker')
    job = sm.describe_training_job(TrainingJobName=training_job_name)

    training_job_prefix = 'new-randomcutforest-'
    training_job_name = training_job_prefix+str(datetime.datetime.today()).replace(' ', '-').replace(':', '-').rsplit('.')[0]

    print("Starting training job %s" % training_job_name)

    resp = sm.create_training_job(
            TrainingJobName=training_job_name, 
            AlgorithmSpecification=job['AlgorithmSpecification'], 
            RoleArn=job['RoleArn'],
            InputDataConfig=job['InputDataConfig'], 
            OutputDataConfig=job['OutputDataConfig'],
            ResourceConfig=job['ResourceConfig'], 
            StoppingCondition=job['StoppingCondition'], 
            VpcConfig=job['VpcConfig'],
            HyperParameters=job['HyperParameters'] if 'HyperParameters' in job else {},
            Tags=job['Tags'] if 'Tags' in job else [])
[...]

And I keep getting the following error message:

An error occurred (ValidationException) when calling the CreateTrainingJob operation: You can’t override the metric definitions for Amazon SageMaker algorithms. Please retry the request without specifying metric definitions.: ClientError Traceback (most recent call last): File “/var/task/lambda_function.py”, line 96, in lambda_handler StoppingCondition=job[‘StoppingCondition’]

, and I get the same error for Hyperparameters and Tags.

I tried to remove these parameters, but they are required, so that's not a solution:

Parameter validation failed:
Missing required parameter in input: "StoppingCondition": ParamValidationError

I tried to hard-code these variables, but it led to the same error.

The exact same function used to work, but only for a few training jobs (around 5), and then it gave this error message. Now it stopped working completely, and the same error message comes up. Any idea why?

Upvotes: 0

Views: 2046

Answers (2)

user13160715
user13160715

Reputation: 11

Before calling "sm.create_training_job", remove the MetricDefinitions key. To do this, pop that key from the 'AlgorithmSpecification' dictionary.

job['AlgorithmSpecification'].pop('MetricDefinitions',None)

Upvotes: 1

ishaaq
ishaaq

Reputation: 6769

It's hard to tell exactly what's going wrong here and why your previous job's hyperparemeters didn't work. Perhaps instead of just passing them along to the new job you could print them out to be able to inspect them?

Going the by this line...

    training_job_prefix = 'new-randomcutforest-'

... I am going to hazard a guess and assume you are trying to run RCF. The hyperparameters that that algo requires are documented here: https://docs.aws.amazon.com/sagemaker/latest/dg/rcf_hyperparameters.html

Upvotes: 0

Related Questions