Reputation: 2043
I am training a classifier using Scikit-learn with the SageMaker Python SDK.
The overall process involves three sequential phases:
The reason I need to split the process is to save the un-calibrated model created at step 2.
For each of this step I prepare a training script as explained in: https://sagemaker.readthedocs.io/en/stable/using_sklearn.html#prepare-a-scikit-learn-training-script
The three scripts are very similar and to avoid code redundancy I would like to use a single script with additional logic inside for the three situation. More precisely, I would like to pass additional custom parameters to the .fit
methods of the sagemaker.tuner.HyperparameterTuner
and sagemaker.sklearn.estimator.SKLearn
objects in order to be able to action the logic in the script depending on the usage (phase 1. ,2. or 3.).
I already tried by hacking the SM_CHANNEL_XXX
parser.add_argument('--myparam', type=str, default=os.environ.get('SM_CHANNEL_MYPRAM'))
while invoking .fit(inputs={'train': ..., 'test': ..., 'myparam': myvalue})
but it expect a valid s3 URI.
Any idea on how to pass extra custom parameters to the training scripts?
Upvotes: 2
Views: 5023
Reputation: 127
According to Sagemaker documentation seen here, you can access the hyperparameters in your training script as command line arguments, in the following way, e.g.:
parser = argparse.ArgumentParser()
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch_size', type=int, default=100)
parser.add_argument('--learning_rate', type=float, default=0.1)
Upvotes: 1
Reputation: 1354
You can pass hyperparameters not in the fit method but right in to step before when you create the estimator. The example in the docs would be:
sklearn_estimator = SKLearn('sklearn-train.py',
train_instance_type='ml.m4.xlarge',
framework_version='0.20.0',
hyperparameters = {'epochs': 20, 'batch-size': 64, 'learning-
rate': 0.1})
sklearn_estimator.fit({'train': 's3://my-data-bucket/path/to/my/training/data',
'test': 's3://my-data-bucket/path/to/my/test/data'})
This is how you bring your parameters (from your notebook) into the training script for access via parser.add_argument. If you have only one script you can handle your logic inside your script. But this does not add custom parameters to the .fit method of sagemaker.tuner.HyperparameterTuner.
I use the following sequence to optimize the parameters in the script and then apply the best parameter (using also only one training script). Maybe you apply this to your case. You should be able to save intermediate models with joblib.dump in your script:
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__use_idf':[False],
'vect__norm':[None],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
]
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
scoring='accuracy',
cv=5,
verbose=1,
n_jobs=-1)
gs_lr_tfidf.fit(X_train, y_train)
Upvotes: 1