Reputation: 121
I have a csv file of 500GB and a mysql database of 1.5 TB of data and I want to run aws sagemaker classification and regression algorithm and random forest on it.
Can aws sagemaker support it? can model be read and trained in batches or chunks? any example for it
Upvotes: 6
Views: 11139
Reputation: 3964
Amazon Sagemaker is built to help you scale your training activities. With large datasets, you might consider two main aspects:
Data storage: S3 is the most cost-effective way to store your data for training. To get faster startup and training times, you can consider the followings:
Pipe
mode with built-in algorithms or bringing your own. But Pipe
mode is not suitable all the time, for example, if your algorithm needs to backtrack or skip ahead within an epoch (the underlying FIFO cannot support lseek() operations) or if it is not easy to parse your training dataset from a streaming source.Training Parallelism: With large datasets, it is likely you'll want to train on different GPUs. In that case, consider the followings:
(bonus) Cost Optimisation: Do not forget to leverage Managed Spot training to save up to 90% of the compute costs.
You will find other examples on the Amazon SageMaker Distributed Training documentation page
Upvotes: 5
Reputation: 119
You can use SageMaker for large scale Machine Learning tasks! It's designed for that. I developed this open source project https://github.com/Kenza-AI/sagify (sagify), it's a CLI tool that can help you train and deploy your Machine Learning/Deep Learning models on SageMaker in a very easy way. I managed to train and deploy all of my ML models whatever library I was using (Keras, Tensorflow, scikit-learn, LightFM, etc)
Upvotes: 1
Reputation: 12901
Amazon SageMaker is designed for such scales and it is possible to use it to train on very large datasets. To take advantage of the scalability of the service you should consider a few modifications to your current practices, mainly around distributed training.
If you want to use distributed training to allow much faster training (“100 hours of a single instance cost exactly the same as 1 hour of 100 instances, just 100 times faster”), more scalable (“if you have 10 times more data, you just add 10 times more instances and everything just works”) and more reliable, as each instance is only handling a small part of the datasets or the model, and doesn’t go out of disk or memory space.
It is not obvious how to implement the ML algorithm in a distributed way that is still efficient and accurate. Amazon SageMaker has modern implementations of classic ML algorithms such as Linear Learner, K-means, PCA, XGBoost etc. that are supporting distributed training, that can scale to such dataset sizes. From some benchmarking these implementations can be 10 times faster compared to other distributed training implementations such as Spark MLLib. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-workshop/blob/master/notebooks/video-game-sales-xgboost.ipynb
The other aspect of the scale is the data file(s). The data shouldn’t be in a single file as it limits the ability to distribute the data across the cluster that you are using for your distributed training. With SageMaker you can decide how to use the data files from Amazon S3. It can be in a fully replicated mode, where all the data is copied to all the workers, but it can also be sharded by key, that distributed the data across the workers, and can speed up the training even further. You can see some examples in this notebook: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/data_distribution_types
Upvotes: 5