Reputation: 21
I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data. Essentially, I want to be able to call this command:
python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv
where /data/train.csv
is my S3 bucket s3://data/
. How can I do this? I currently only see ways to cp
my S3 data into my EC2.
Upvotes: 2
Views: 1254
Reputation: 238299
How can I do this? I currently only see ways to cp my S3 data into my EC2.
S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.
Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).
Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:
With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.
Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.
Upvotes: 2