prashant0598
prashant0598

Reputation: 233

Creating a Training Job using sagemaker estimator gives me "error: unrecognized arguments: train"

I am using a docker image which have all required files and then push it to aws ecr where i can use that image to pass to estimator. I have added the train.py file as entrypoint in dockerfile.

ENTRYPOINT ["python3", "-m","train"]

This works as required in local with docker run -it image but when running training job i get the error.

Training - Training image download completed. Training in progress...usage: train.py [-h] [--epochs EPOCHS] [--learning_rate LEARNING_RATE]
                [--max_sequence_length MAX_SEQUENCE_LENGTH]
                [--train_batch_size TRAIN_BATCH_SIZE]
                [--valid_batch_size VALID_BATCH_SIZE]
train.py: error: unrecognized arguments: train

The training job using sagemaker estimator:

estimator = sagemaker.estimator.Estimator(image, # docker image
                                          role,
                                          train_instance_count=1, 
                                          train_instance_type='ml.p2.xlarge', 
                                          output_path=output_path, 
                                          hyperparameters=hyperparameters,
                                          sagemaker_session=session
                                         )

Train.py main fun():

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument('--epochs', type=int, default=2)
    parser.add_argument('--learning_rate', type=float, default=2e-5)
    parser.add_argument('--max_sequence_length', type=int, default=512)
    parser.add_argument('--train_batch_size', type=int, default=12)
    parser.add_argument('--valid_batch_size', type=int, default=8)
    # SageMaker environment variables.
    #parser.add_argument('--hosts', type=str, default=os.environ['SM_HOSTS'])
    #parser.add_argument('--current_host', type=str, default=os.environ['SM_CURRENT_HOST'])
    # Parse command-line args and run main.
    args = parser.parse_args()
    # Get SageMaker host information from runtime environment variables
    #sm_hosts = json.loads(args.hosts)
    #sm_current_host = args.current_host
    train(args)

From sagemaker doc i was able to find that image in training job runs as docker run image train and when i tried to the same in local i got the same error.

Upvotes: 3

Views: 1696

Answers (3)

rok
rok

Reputation: 2765

You don't need to define the ENTRYPOINT at all. What I do is simply have a train (with no file extension) file with my training code. Be sure to make it executable and put it in /opt/ml/code. See the complete code here.

Upvotes: 2

NpnSaddy
NpnSaddy

Reputation: 325

Assuming train.py is at root level or working directory of your docker,

Following code should solve the issue for you:

ENTRYPOINT ["python3", "train.py"]

More info: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/scikit_bring_your_own/scikit_bring_your_own.ipynb

Upvotes: 4

Related Questions