Ice Ge
Ice Ge

Reputation: 41

Use SageMaker Lifecycle configuration to execute a jupyter notebook on start

I want to set up some automatic schedule for running my SageMaker Notebook.
Currently I found link like this:
https://towardsdatascience.com/automating-aws-sagemaker-notebooks-2dec62bc2c84

I followed the steps to set up the lamda, cloudwatch, and the Lifecycle configuration.
During different experiment, some times the on_start lifecycle configuration can execute the jupyter notebook (In the notebook i just install some package and load the package and save the loading status to S3 bucket). However, it failed due to it can't stop the notebook.

Then I added permission to my IAM role for SageMaker autostop. Now the notebook instance can be turn on and turn off. But I don't see anything uploaded to my S3 any more. I am wondering if the on_start started the auto-stop too early before it finish the steps?

Below is my script for the current lifecycle configuration

set -e

ENVIRONMENT=python3
NOTEBOOK_FILE="/home/ec2-user/SageMaker/Test Notebook.ipynb"
AUTO_STOP_FILE="/home/ec2-user/SageMaker/auto-stop.py"

source /home/ec2-user/anaconda3/bin/activate "$ENVIRONMENT"

nohup jupyter nbconvert --ExecutePreprocessor.timeout=-1 --ExecutePreprocessor.kernel_name=python3 --execute "$NOTEBOOK_FILE" &

echo "Finishing running the jupyter notebook"

source /home/ec2-user/anaconda3/bin/deactivate

# PARAMETERS
IDLE_TIME=60  # 1 minute

echo "Fetching the autostop script"
wget -O autostop.py https://raw.githubusercontent.com/mariokostelac/sagemaker-setup/master/scripts/auto-stop-idle/autostop.py

echo "Starting the SageMaker autostop script in cron"
(crontab -l 2>/dev/null; echo "*/1 * * * * /bin/bash -c '/usr/bin/python3 $DIR/autostop.py --time ${IDLE_TIME} | tee -a /home/ec2-user/SageMaker/auto-stop-idle.log'") | crontab -

Note that, I do see the echo "Finishing running the jupyter notebook" from the cloudwatch log. But that's usually the first thing i saw from the log and it shows up immediately - faster than I expect how long it should take.

Also, currently the notebook is only running some fake task. The real task may take more than an hour.

Any suggestions help! Thank you for taking the time to read my questions.

Upvotes: 4

Views: 5712

Answers (2)

TheLioness
TheLioness

Reputation: 29

What's happening here

nohup jupyter nbconvert --ExecutePreprocessor.timeout=-1 --ExecutePreprocessor.kernel_name=python3 --execute "$NOTEBOOK_FILE" &

is that nohup allows the notebook to keep executing and ignore the time limit. The & pushes the notebook execution to the background.

Combined, the script starts executing the notebook in the background ignoring the time limit while the rest of the script continues running. That's why you see "Finishing running the jupyter notebook" in the logs directly after the notebook starts executing in the background.

Now the flow of your lifecycle configuration is:

  1. Start executing the notebook in the background
  2. Wait for 60 seconds of inactivity in the notebook
  3. Auto stop the instance

The instance is stopping after one minute of apparent inactivity and since the notebook is executing in the background it's considered inactivity. Since your notebook takes more than one minute to run, it's not finishing and the file isn't written to s3.

To allow more time for your notebook to run, increase the idle time.

Upvotes: -1

Jun L
Jun L

Reputation: 49

When you say

I do see the echo "Finishing running the jupyter notebook" from the cloudwatch log. But that's usually the first thing i saw from the log and it shows up immediately - faster than I expect how long it should take.

That's expected when you have this line in your script

nohup jupyter nbconvert --ExecutePreprocessor.timeout=-1 --ExecutePreprocessor.kernel_name=python3 --execute "$NOTEBOOK_FILE" &

nohup helps the process to keep running even you logout from the terminal. & sends the process to the background. So as a result the next command runs immediately after this line.

You are probably using 'nohup' and '&' here because running the notebook takes longer than the maximum allowed time for a LifecycleConfiguration script, which is good in my opinion.


Now the notebook instance can be turn on and turn off. But I don't see anything uploaded to my S3 any more. I am wondering if the on_start started the auto-stop too early before it finish the steps?

In your script you have

(crontab -l 2>/dev/null; echo "*/1 * * * * /bin/bash -c '/usr/bin/python3 $DIR/autostop.py --time ${IDLE_TIME} | tee -a /home/ec2-user/SageMaker/auto-stop-idle.log'") | crontab -

This is setting up a Cron job that runs every minute. The job executes the $DIR/autostop.py script (it looks like the value of $DIR is not set by the way). And the autostop.py script uses the $IDLE_TIME to determine whether it should call the stop_notebook_instance API.

Without looking into details about what the autostop.py script does. It's possible that you need to tune the frequency of the Cron job, or tune the $IDLE_TIME.

Another thought is since you said your real notebook will take more than 1 hour, maybe you can just let the notebook call stop_notebook_instance API at the last cell.

Jun

Upvotes: 2

Related Questions