Differences between using Sagemaker notebook vs Glue (Sagemaker) notebook

Question

I have a Machine Learning job I want to run with Sagemaker. For data preparation and transformation, I am using some numpy and pandas steps to transform them with notebook.

I noticed AWS Glue have both Sagemaker and Zeppelin notebook which can be created via development endpoint

There isn't much info online i could find what's the difference and benefit of using one over another (i.e. Sagemaker notebook and import from s3 vs creating notebook from Glue)

From what i researched and tried it seems that i can achieve same thing with both:

Sagemaker notebook and import directly from s3 + further python code to process the data
Glue (need to crawl and create dataset) as shown here, create dev endpoint and use similar script to process the data.

Anyone able to shed light on this?

Abdelrahman Maharek · Accepted Answer

The question isn't clear but let me explain this point.

When you launch a Glue Development endpoint you can attach either a SageMaker notebook or Zeppelin notebook. Both will be created and configured by Glue and your script will be executed on the Glue Dev endpoint.

If your question is "what is the difference between a SageMaker notebook created from Glue console and a SageMaker notebook created from SageMaker console?

When you create a notebook instance from Glue console, the created notebook will always have public internet access enabled. This blog explains the difference between the networking configurations with SM notebooks. You cannot also create the notebook with a specific disk size but you can stop the notebook once it's created and increase disk size.

If your question is "what is the difference between SageMaker notebook and Zeppelin notebooks?"

The answer is the first one used Jupter (very popular) while the second one uses Zeppelin.

If your question is "what is the difference between using only a SageMaker notebook versus using SM notebook + Glue dev Endpoint?"

The answer is: if you are running normal pandas + numpy without using Spark, SM notebook is much cheaper (if you use small instance type and if your data is relatively small). However, if you are trying to process a large dataset and you are planning to use spark, then SM notebook + Glue Dev endpoint will be the best option to develop the job which will be executed later as a Glue Job (transformation job) (server less).

SM notebook is like running python code on an EC2 instance versus SM notebook + Glue which is used to develop ETL jobs which you can launch to process deltas.

Differences between using Sagemaker notebook vs Glue (Sagemaker) notebook

Answers (2)

Related Questions