Reputation: 823
I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.
I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?
Upvotes: 72
Views: 113509
Reputation: 11
There are multiple ways to read data into Sagemaker. To make the response more comprehensive i am adding details to read the data into Sagemaker Studio Notebook in memory as well as S3 mounting options.
Though Notebooks are not recommend for data intensive modeling and are more used for prototyping based on my experience, there are multiple ways the data can be read into it.
Both Boto3 and S3FS can also be used in conjunction with python libraries like Pandas to read the data in memory as well as can also be used to copy the data to local instance EFS.
These two options provide a mount like behaviour where the data appears to be in as if the local directory for higher IO operations. Both of these options have their pros and cons.
Upvotes: 0
Reputation: 2053
You can also use AWS Data Wrangler https://github.com/awslabs/aws-data-wrangler:
import awswrangler as wr
df = wr.s3.read_csv(path="s3://...")
Upvotes: 4
Reputation: 2053
In the simplest case you don't need boto3
, because you just read resources.
Then it's even simpler:
import pandas as pd
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role
Upvotes: 54
Reputation: 5153
A similar answer with the f-string
.
import pandas as pd
bucket = 'your-bucket-name'
file = 'file.csv'
df = pd.read_csv(f"s3://{bucket}/{file}")
len(df) # print row counts
Upvotes: 3
Reputation: 2959
This code sample to import csv file from S3, tested at SageMaker notebook.
Use pip or conda to install s3fs. !pip install s3fs
import pandas as pd
my_bucket = '' #declare bucket name
my_file = 'aa/bb.csv' #declare file path
import boto3 # AWS Python SDK
from sagemaker import get_execution_role
role = get_execution_role()
data_location = 's3://{}/{}'.format(my_bucket,my_file)
data=pd.read_csv(data_location)
data.head(2)
Upvotes: 0
Reputation: 4394
You could also access your bucket as your file system using s3fs
import s3fs
fs = s3fs.S3FileSystem()
# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]
# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
display(Image.open(f))
Upvotes: 13
Reputation: 669
import boto3
import pandas as pd
from sagemaker import get_execution_role
role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_csv(data_location)
Upvotes: 62
Reputation: 239
Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.
Upvotes: 5