Andreas L.
Andreas L.

Reputation: 4541

Terraform Error: error waiting for sagemaker notebook instance to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)

The entire error message after executing terraform apply within the terraform-folder of this source code in my GitHub-repo (inspired by this tutorial and its related GitHub-repo):

aws_sagemaker_notebook_instance.notebook_instance: Creating...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [10s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [20s elapsed]
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m21s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [15m31s elapsed]
╷
│ Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)
│
│   with aws_sagemaker_notebook_instance.notebook_instance,
│   on notebook_instance.tf line 2, in resource "aws_sagemaker_notebook_instance" "notebook_instance":
│    2: resource "aws_sagemaker_notebook_instance" "notebook_instance" {
│

Internet research seemed to provide the solution in this article, which inspired be to increase the allowed IDLE_TIME in the on-start.sh - script to IDLE_TIME=1800 (in seconds, which equals 30 minutes). This should've been sufficient for the deployment time of around 15 minutes; yet, it threw the same error again.

Next, I found this post on StackOverFlow suggesting to

run terraform refresh, which will cause Terraform to refresh its state file against what actually exists with the cloud provider.

Unfortunately, running terraform apply right after refreshing didn't resolve the issue either. I'm wondering why the aforementioned IDLE_TIME=1800 - setting does not have any effect. This should be more than sufficient for a 15-minute apply-time.


EDIT: adding code specifics for enhanced understanding

1. Creating the SageMaker notebook instance

resource "aws_sagemaker_notebook_instance" "notebook_instance" {
  name                    = "aws-sm-notebook-instance"
  role_arn                = aws_iam_role.notebook_iam_role.arn
  instance_type           = "ml.t2.medium"
  lifecycle_config_name   = aws_sagemaker_notebook_instance_lifecycle_configuration.notebook_config.name
  default_code_repository = aws_sagemaker_code_repository.git_repo.code_repository_name
}

2. Defining the SageMaker notebook lifecycle configuration

resource "aws_sagemaker_notebook_instance_lifecycle_configuration" "notebook_config" {
  name      = "dev-platform-al-sm-lifecycle-config"
  on_create = filebase64("../scripts/on-create.sh")
  on_start  = filebase64("../scripts/on-start.sh")
}

3. Defining the Git repo to instantiate on the SageMaker notebook instance

resource "aws_sagemaker_code_repository" "git_repo" {
  code_repository_name = "aws-sm-notebook-instance-repo"

  git_config {
    repository_url = "https://github.com/AndreasLuckert/aws-sm-notebook-instance.git"
  }
}

Contents of on-start.sh (including IDLE_TIME - parameter) Note that this script will be invoked by the scripts/autostop.py - script, which you can find here in the associated public repo containing the source code.

#!/bin/bash

set -e

## IDLE AUTOSTOP STEPS
## ----------------------------------------------------------------

## Setting the timeout (in seconds) for how long the SageMaker notebook can run idly before being auto-stopped
# -> e.g. 1800 s = 30 min since first deployment can take between 15 and 20 minutes which could then fail like so:
# "Error: error waiting for sagemaker notebook instance (aws-sm-notebook-instance) to create: unexpected state 'Failed', wanted target 'InService'. last error: %!s(<nil>)"
# Hint for solution under following link: https://yuyasugano.medium.com/machine-learning-infrastructure-terraforming-sagemaker-part-2-f2460a9a4663
IDLE_TIME=1800

# Getting the autostop.py script from GitHub
echo "Fetching the autostop script..."
wget https://raw.githubusercontent.com/andreasluckert/aws-sm-notebook-instance/main/scripts/autostop.py

# Using crontab to autostop the notebook when idle time is breached
echo "Starting the SageMaker autostop script in cron."
(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python $PWD/autostop.py --time $IDLE_TIME --ignore-connections") | crontab -



## CUSTOM CONDA KERNEL USAGE STEPS
## ----------------------------------------------------------------

# Setting the proper user credentials
sudo -u ec2-user -i <<'EOF'
unset SUDO_UID

# Setting the source for the custom conda kernel
WORKING_DIR=/home/ec2-user/SageMaker/custom-miniconda
source "$WORKING_DIR/miniconda/bin/activate"

# Loading all the custom kernels
for env in $WORKING_DIR/miniconda/envs/*; do
    BASENAME=$(basename "$env")
    source activate "$BASENAME"
    python -m ipykernel install --user --name "$BASENAME" --display-name "Custom ($BASENAME)"
done

Upvotes: 0

Views: 2120

Answers (1)

Andreas L.
Andreas L.

Reputation: 4541

The solution to the problem was to check the CloudWatch Log events under CloudWatch -> Log groups -> /aws/sagemaker/NotebookInstances -> aws-sm-notebook-instance/LifecycleConfigOnCreate to find the following error-message:

/bin/bash: /tmp/OnCreate_2021-09-08-12-24rw5al34g: /bin/bash^M: bad interpreter: No such file or directory

A bit of internet research brought me to this solution related to newline characters in shell-scripts, which depend on whether you are on Windows or a UNIX-system. As I'm working on Windows, the shell-scripts created in VS-Code comprised dos-specific CRLF newline-handling, which could be resolved via the button on the bottom-right in VS-Code to switch the carriage return (CRLF) character to the line feed (LF) character used by UNIX.

As the compute instance employed by AWS Sagemaker is a Linux-system, it cannot handle the dos-style CRLF newline-characters in the shell-scripts and this "adds" a ^M after /bin/bash which obviously leads to an error as such an interpreter does not exist.

So, finally terraform apply worked out well:

$ terraform apply
...
...
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m30s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Still creating... [7m40s elapsed]
aws_sagemaker_notebook_instance.notebook_instance: Creation complete after 7m43s [id=aws-sm-notebook-instance]

Apply complete! Resources: 1 added, 1 changed, 1 destroyed.

Upvotes: 0

Related Questions