cpumar
cpumar

Reputation: 135

What are the differences between AWS sagemaker and sagemaker_pyspark?

I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker and sagemaker_pyspark. I would like to work with distributed data. My questions are:

  1. Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark? Based on this assumption, I do not understand what it would offer regarding using scikit-learn on a SageMaker notebook (in terms of computing capabilities).

  2. Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:

from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator


xgboost_estimator = XGBoostSageMakerEstimator (
    trainingInstanceType = "ml.m4.xlarge",
    trainingInstanceCount = 1,
    endpointInstanceType = "ml.m4.xlarge",
    endpointInitialInstanceCount = 1,
    sagemakerRole = IAMRole(get_execution_role())
)

xgboost_estimator.setNumRound(1)

If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?

  1. Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?

  2. Do you know if sagemaker_pyspark can perform hyperparameter optimization? From what I see, sagemaker offers the HyperparameterTuner class, but I can't find anything like it in sagemaker_pyspark. I suppose it is a more recent library and there is still a lot of functionality to implement.

  3. I am a bit confused about the concept of entry_point and container/image_name (both possible input arguments for the Estimator object from the sagemaker library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point script? It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?

  4. I see the sagemaker library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark.

  5. I also found out that sagemaker has the sagemaker.sparkml.model.SparkMLModel object. What is the difference between this and what sagemaker_pyspark offers?

Upvotes: 1

Views: 975

Answers (1)

Neil McGuigan
Neil McGuigan

Reputation: 48246

sagemaker is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier

  1. Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?

No. You can run distributed training jobs using sagemaker (see instance_count parameter)

sagemaker_pyspark facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark

  1. Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?

Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn

  1. Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?

Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage URI

  1. Do you know if sagemaker_pyspark can perform hyperparameter optimization?

It does not appear so. It'd probably be easier just to do this from SageMaker itself though

can you deploy models with and without containers?

You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required

why would you use model containers?

Do you always need to define the model externally with the entry_point script?

The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point

It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?

Please clarify which "three" you are referring to

6 is not a question, so no answer required :)

  1. What is the difference between this and what sagemaker_pyspark offers?

sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker

Upvotes: 1

Related Questions