Jonathan
Jonathan

Reputation: 5

Read and write files in local structure when automating notebooks in Sagemaker Studio Classic

I have written a Jupyter Notebook within AWS Sagemaker Studio Classic that

Simplified, I have the following folder structure:

root/
├─ conversion_scores/
│  ├─ data_preparation.ipynb
│  ├─ utils/
│  │  ├─ SnowflakeSetup.py
│  ├─ data/
│  │  ├─ testsub/
│  │  │  ├─ to_predict/
│  │  │  │  ├─ output_file.csv

Within data_preparation.ipynb, I do two things that are relevant here

  1. I load SnowflakeSetup.py:
import os
import sys
sys.path.append(os.path.join(os.getcwd(), "utils"))
import SnowflakeSetup
  1. I write the output_file.csv file to the data folder
pathname = os.path.join(os.getcwd(), "data", "testsub", "to_predict", f"{end_date.strftime('%Y-%m-%d')}_prediction_run_data.csv")
df.to_csv(pathname, index=False)

When I run this Jupyter notebook manually, it works fine, as I work within root/, and thus e.g. sys.path.append(os.path.join(os.getcwd(), "utils")) gives back the correct path /root/chm-conversion-scores/util.

However, I would like to have this notebook run on a schedule, and then it apparently does not operate in the same folder. The sys.path.append(os.path.join(os.getcwd(), "utils")) line then gives back /opt/ml/input/data/sagemaker_headless_execution/utils, where naturally the script cannot be found, leading to ModuleNotFoundError[0m: No module named 'SnowflakeSetup'

I assume the same problem would occur in the writing step of the .csv-file into a local subfolder, although I never got so far.

I tried to hardcode the path as /root/chm-conversion-scores/util, but this does not work either. I also tried using

project_root = os.getenv('PROJECT_ROOT', os.getcwd())
sys.path.append(os.path.join(project_root, "utils"))
import SnowflakeSetup

and

from pathlib import Path
project_root = Path('/root/chm-conversion-scores')
sys.path.append(str(project_root / "utils"))
import SnowflakeSetup

both to no avail.

How can I read from and write to the local folder structure even in an automated job?

Upvotes: 0

Views: 226

Answers (1)

durga_sury
durga_sury

Reputation: 1152

Notebook jobs run on ephemeral instances, similar to running a SageMaker training job. So, it does not have access to the local EFS or the rest of the files in your directory, only the .ipynb file. You can refer to the Additional (file or folder) dependencies under Custom options here.

You have two options for the input file -

  1. Use the SageMaker Python SDK to create the notebook job, in that case, you can pass in the additional files or folders. They get uploaded to S3 and available for you when the job is run. See a sample notebook here.
  2. Upload the utils file to S3 and read from S3 in your notebook.

In either cases though, you won't be able to write the output csv file to local storage (EFS). It will be loaded to S3, and you can download the file from S3.

Upvotes: 0

Related Questions