Reputation: 21

Read S3 Bucket from EC2 for ML Training

I am trying to train a machine learning model on AWS EC2. I have over 50GB of data currently stored in an AWS S3 bucket. When training my model on EC2, I want to be able to access this data. Essentially, I want to be able to call this command:

python3 train_model.py --train_files /data/train.csv --dev_files /data/dev.csv --test_files /data/test.csv

where /data/train.csv is my S3 bucket s3://data/. How can I do this? I currently only see ways to cp my S3 data into my EC2.

Upvotes: 2

Answers (2)

Marcin

Reputation: 238299

How can I do this? I currently only see ways to cp my S3 data into my EC2.

S3 is a object storage system. It does not allow for direct access nor reading of files like a regular file system.

Thus to read your files, you need to download it first (downloading in parts is also possible), or have some third party software do it for you like s3-fuse. You can download it to your instance, or store in external file system (e.g. EFS).

Its not clear from your question if you have one 50GB CSV file, or multiple small ones. In case you have one large CSV file of 50GB in size, you can reduce the amount of data read, if not all of its needed, at once using S3 Select:

With S3 Select, you can use a simple SQL expression to return only the data from the store you’re interested in, instead of retrieving the entire object. This means you’re dealing with an order of magnitude less data which improves the performance of your underlying applications.

Amazon S3 Select works on objects stored in CSV, JSON, or Apache Parquet format.

Upvotes: 2

Yaron

Reputation: 1242

You can develop an enhancement to your code using boto.

But if you want access to your S3 as if it was another local filesystem I would consider s3fs-fuse, explained further here.

Another option would be to use the aws-cli to sync your code to a local folder.

Upvotes: 1

Read S3 Bucket from EC2 for ML Training

Answers (2)

Related Questions