Reputation: 177
I will be running ml models on a pretty large dataset. It is about 15 gb, with 200 columns and 4.3 million rows. I'm wondering what the best Notebook instance type is for this kind of dataset in AWS Sagemaker.
Upvotes: 3
Views: 4392
Reputation: 1875
For choosing a SageMaker hosted notebook type:
Do you plan to do all of your preprocessing of your data in-memory on the notebook, or do you plan to orchestrate ETL with external services?
If you're planning to load the dataset into memory on the notebook instance for exploration/preprocessing, the primary bottleneck here would be ensuring the instance has enough memory for your dataset. This would require at least the 16gb types (.xlarge) (full list of ML instance types available here). Further, depending on how compute intensive your pre-processing is, and your desired pre-processing completion time, you can opt for a compute optimized instance (c4, c5) to speed this up.
For the training job, specifically:
Using the Amazon SageMaker SDK, your training data will be loaded and distributed to the training cluster, allowing your training job to be completely separate from the instance your hosted notebook is running on.
Figuring out the ideal instance type for training will depend on whether your algorithm of choice/training job is memory, CPU, or IO bound. Since your dataset will likely be loaded onto your training cluster from S3, the instance you choose for your hosted notebook will have no bearing on the speed of your training job.
Broadly: When it comes to SageMaker notebooks, the best practice is to use your notebook as a "puppeteer" or orchestrator, that calls out to external services (AWS Glue or Amazon EMR for preprocessing, SageMaker for training, S3 for storage, etc). It is best to treat them as ephemeral forms of compute/storage for building and kicking off your experiment pipeline.
This will allow you to more closely pair compute, storage, and hosting resources/services with the demands for your workload, ultimately resulting in the best bang for your buck by not having you pay for latent or unused resources.
Upvotes: 7