doesnt_matter
doesnt_matter

Reputation: 121

AWS SageMaker Very large Dataset

I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.

Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it

Upvotes: 6

Views: 11139

Answers (3)

herve
herve

Reputation: 3964

Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:

  • The way data are stored and accessed,
  • The actual training parallelism.

Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:

  • If your data is are already stored on Amazon S3, you might want first to consider leveraging the Pipe mode with built-in algorithms or bringing your own. But Pipe mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.
  • In those cases, you may want to leverage Amazon FSx for Lustre and Amazon EFS file systems. If your training data is already in an Amazon EFS, I recommend using it as a data source; otherwise, choose Amazon FSx for Lustre.

Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:

  • If your training is already Horovod ready, you can do it with Amazon SageMaker (notebook).
  • In December, AWS has released managed data parallelism, which simplifies parallel training over multiple GPUs. As of today, it is available for TensorFlow and PyTorch.

(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.

You will find other examples on the Amazon SageMaker Distributed Training documentation page

Upvotes: 5

pm3310
pm3310

Reputation: 119

You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)

Upvotes: 1

Guy
Guy

Reputation: 12901

Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.

If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.

It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb

The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types

Upvotes: 5

Related Questions