Reputation: 4541
The entire error message after executing terraform apply
within the terraform-folder of this source code in my GitHub-repo (inspired by this tutorial and its related GitHub-repo):
aws_sagemaker_notebook_instance.notebook_instance: Creating...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [10s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [20s elapsed]
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m21s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m31s elapsed]
╷
│ Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
│
│ with aws_sagemaker_notebook_instance.notebook_instance,
│ on notebook_instance.tf line 2, in resource "aws_sagemaker_notebook_instance" "notebook_instance":
│ 2: resource "aws_sagemaker_notebook_instance" "notebook_instance" {
│
Internet research seemed to provide the solution in this article, which inspired be to increase the allowed IDLE_TIME
in the on-start.sh
- script to IDLE_TIME=1800
(in seconds, which equals 30 minutes). This should've been sufficient for the deployment time of around 15 minutes; yet, it threw the same error again.
Next, I found this post on StackOverFlow suggesting to
run
terraform refresh
, which will cause Terraform to refresh its state file against what actually exists with the cloud provider.
Unfortunately, running terraform apply
right after refreshing didn't resolve the issue either.
I'm wondering why the aforementioned IDLE_TIME=1800
- setting does not have any effect. This should be more than sufficient for a 15-minute apply-time.
EDIT: adding code specifics for enhanced understanding
1. Creating the SageMaker notebook instance
resource "aws_sagemaker_notebook_instance" "notebook_instance" {
name = "aws-sm-notebook-instance"
role_arn = aws_iam_role.notebook_iam_role.arn
instance_type = "ml.t2.medium"
lifecycle_config_name = aws_sagemaker_notebook_instance_lifecycle_configuration.notebook_config.name
default_code_repository = aws_sagemaker_code_repository.git_repo.code_repository_name
}
2. Defining the SageMaker notebook lifecycle configuration
resource "aws_sagemaker_notebook_instance_lifecycle_configuration" "notebook_config" {
name = "dev-platform-al-sm-lifecycle-config"
on_create = filebase64("../scripts/on-create.sh")
on_start = filebase64("../scripts/on-start.sh")
}
3. Defining the Git repo to instantiate on the SageMaker notebook instance
resource "aws_sagemaker_code_repository" "git_repo" {
code_repository_name = "aws-sm-notebook-instance-repo"
git_config {
repository_url = "https://github.com/AndreasLuckert/aws-sm-notebook-instance.git"
}
}
Contents of on-start.sh
(including IDLE_TIME - parameter)
Note that this script will be invoked by the scripts/autostop.py
- script, which you can find here in the associated public repo containing the source code.
#!/bin/bash
set -e
## IDLE AUTOSTOP STEPS
## ----------------------------------------------------------------
## Setting the timeout (in seconds) for how long the SageMaker notebook can run idly before being auto-stopped
# -> e.g. 1800 s = 30 min since first deployment can take between 15 and 20 minutes which could then fail like so:
# "Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)"
# Hint for solution under following link: https://yuyasugano.medium.com/machine-learning-infrastructure-terraforming-sagemaker-part-2-f2460a9a4663
IDLE_TIME=1800
# Getting the autostop.py script from GitHub
echo "Fetching the autostop script..."
wget https://raw.githubusercontent.com/andreasluckert/aws-sm-notebook-instance/main/scripts/autostop.py
# Using crontab to autostop the notebook when idle time is breached
echo "Starting the SageMaker autostop script in cron."
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -
## CUSTOM CONDA KERNEL USAGE STEPS
## ----------------------------------------------------------------
# Setting the proper user credentials
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID
# Setting the source for the custom conda kernel
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
source "$WORKING_DIR/miniconda/bin/activate"
# Loading all the custom kernels
for env in $WORKING_DIR/miniconda/envs/*; do
BASENAME=$(basename "$env")
source activate "$BASENAME"
python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done
Upvotes: 0
Views: 2120
Reputation: 4541
The solution to the problem was to check the CloudWatch Log events under CloudWatch -> Log groups -> /aws/sagemaker/NotebookInstances -> aws-sm-notebook-instance/LifecycleConfigOnCreate
to find the following error-message:
/bin/bash: /tmp/OnCreate_2021-09-08-12-24rw5al34g: /bin/bash^M: bad interpreter: No such file or directory
A bit of internet research brought me to this solution related to newline characters in shell-scripts, which depend on whether you are on Windows
or a UNIX
-system.
As I'm working on Windows, the shell-scripts created in VS-Code comprised dos-specific CRLF
newline-handling, which could be resolved via the button on the bottom-right in VS-Code
to switch the carriage return (CRLF) character to the line feed (LF) character used by UNIX.
As the compute instance employed by AWS Sagemaker is a Linux-system, it cannot handle the dos-style CRLF newline-characters in the shell-scripts and this "adds" a ^M
after /bin/bash
which obviously leads to an error as such an interpreter does not exist.
So, finally terraform apply
worked out well:
$ terraform apply
...
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m30s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m40s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Creation complete after 7m43s [id=aws-sm-notebook-instance]
Apply complete! Resources: 1 added, 1 changed, 1 destroyed.
Upvotes: 0