Reputation: 77
I'm new to AWS and I am considering to use amazon sagemaker to train my deep learning model because I'm having memory issues due to the large dataset and neural network that I have to train. I'm confused whether to store my data in my notebook instance or in S3? If I store it in my s3 would I be able to access it to train on my notebook instance? I'm confused on the concepts. Can anyone explain the use of S3 in machine learning in AWS?
Upvotes: 1
Views: 776
Reputation: 848
Yes you can use S3 as storage for your training datasets.
Refer diagram in this link describing how everything works together: https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-training.html
You may also want to checkout following blogs that details about File mode and Pipe mode, two mechanisms for transferring training data:
In File mode, the training data is downloaded first to an encrypted EBS volume attached to the training instance prior to commencing the training. However, in Pipe mode the input data is streamed directly to the training algorithm while it is running.
With Pipe input mode, your data is fed on-the-fly into the algorithm container without involving any disk I/O. This approach shortens the lengthy download process and dramatically reduces startup time. It also offers generally better read throughput than File input mode. This is because your data is fetched from Amazon S3 by a highly optimized multi-threaded background process. It also allows you to train on datasets that are much larger than the 16 TB Amazon Elastic Block Store (EBS) volume size limit.
The blog also contains python code snippets using Pipe input mode for reference.
Upvotes: 3