Reputation: 135
I'm currently running a quick Machine Learning proof of concept on AWS with SageMaker, and I've come across two libraries: sagemaker
and sagemaker_pyspark
. I would like to work with distributed data. My questions are:
Is using sagemaker
the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark
? Based on this assumption, I do not understand what it would offer regarding using scikit-learn
on a SageMaker notebook (in terms of computing capabilities).
Is it normal for something like model = xgboost_estimator.fit(training_data)
to take 4 minutes to run with sagemaker_pyspark
for a small set of test data? I see that what it does below is to train the model and also create an Endpoint to be able to offer its predictive services, and I assume that this endpoint is deployed on an EC2 instance that is created and started at the moment. Correct me if I'm wrong. I assume this from how the estimator is defined:
from sagemaker import get_execution_role
from sagemaker_pyspark.algorithms import XGBoostSageMakerEstimator
xgboost_estimator = XGBoostSageMakerEstimator (
trainingInstanceType = "ml.m4.xlarge",
trainingInstanceCount = 1,
endpointInstanceType = "ml.m4.xlarge",
endpointInitialInstanceCount = 1,
sagemakerRole = IAMRole(get_execution_role())
)
xgboost_estimator.setNumRound(1)
If so, is there a way to reuse the same endpoint with different training jobs so that I don't have to wait for a new endpoint to be created each time?
Does sagemaker_pyspark
support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Do you know if sagemaker_pyspark
can perform hyperparameter optimization? From what I see, sagemaker
offers the HyperparameterTuner
class, but I can't find anything like it in sagemaker_pyspark
. I suppose it is a more recent library and there is still a lot of functionality to implement.
I am a bit confused about the concept of entry_point
and container
/image_name
(both possible input arguments for the Estimator
object from the sagemaker
library): can you deploy models with and without containers? why would you use model containers? Do you always need to define the model externally with the entry_point
script? It is also confusing that the class AlgorithmEstimator
allows the input argument algorithm_arn
; I see there are three different ways of passing a model as input, why? which one is better?
I see the sagemaker
library offers SageMaker Pipelines, which seem to be very handy for deploying properly structured ML workflows. However, I don't think this is available with sagemaker_pyspark
, so in that case, I would rather create my workflows with a combination of Step Functions (to orchestrate the entire thing), Glue processes (for ETL, preprocessing and feature/target engineering) and SageMaker processes using sagemaker_pyspark
.
I also found out that sagemaker
has the sagemaker.sparkml.model.SparkMLModel
object. What is the difference between this and what sagemaker_pyspark
offers?
Upvotes: 1
Views: 975
Reputation: 48246
sagemaker
is the SageMaker Python SDK. It calls SageMaker-related AWS service APIs on your behalf. You don't need to use it, but it can make life easier
- Is using sagemaker the equivalent of running a training job without taking advantage of the distributed computing capabilities of AWS? I assume it is, if not, why have they implemented sagemaker_pyspark?
No. You can run distributed training jobs using sagemaker
(see instance_count
parameter)
sagemaker_pyspark
facilitates calling SageMaker-related AWS service APIs from Spark. Use it if you want to use SageMaker services from Spark
- Is it normal for something like model = xgboost_estimator.fit(training_data) to take 4 minutes to run with sagemaker_pyspark for a small set of test data?
Yes, it takes a few minutes for an EC2 instance to spin-up. Use Local Mode if you want to iterate more quickly locally. Note: Local Mode won't work with SageMaker built-in algorithms, but you can prototype with (non AWS) XGBoost/SciKit-Learn
- Does sagemaker_pyspark support custom algorithms? Or does it only allow you to use the predefined ones in the library?
Yes, but you'd probably want to extend SageMakerEstimator. Here you can provide the trainingImage
URI
- Do you know if sagemaker_pyspark can perform hyperparameter optimization?
It does not appear so. It'd probably be easier just to do this from SageMaker itself though
can you deploy models with and without containers?
You can certainly host your own models any way you want. But if you want to use SageMaker model inference hosting, then containers are required
why would you use model containers?
Do you always need to define the model externally with the entry_point script?
The whole Docker thing makes bundling dependencies easier, and also makes things language/runtime-neutral. SageMaker doesn't care if your algorithm is in Python or Java or Fortran. But it needs to know how to "run" it, so you tell it a working directory and a command to run. This is the entry point
It is also confusing that the class AlgorithmEstimator allows the input argument algorithm_arn; I see there are three different ways of passing a model as input, why? which one is better?
Please clarify which "three" you are referring to
6 is not a question, so no answer required :)
- What is the difference between this and what sagemaker_pyspark offers?
sagemaker_pyspark lets you call SageMaker services from Spark, whereas SparkML Serving lets you use Spark ML services from SageMaker
Upvotes: 1