Atwine Mugume
Atwine Mugume

Reputation: 55

how to link s3 bucket to sagemaker notebook

I am trying to link my s3 bucket to a notebook instance, however i am not able to:

Here is how much I know:

from sagemaker import get_execution_role

role = get_execution_role
bucket = 'atwinebankloadrisk'
datalocation = 'atwinebankloadrisk'

data_location = 's3://{}/'.format(bucket)
output_location = 's3://{}/'.format(bucket)

to call the data from the bucket:

df_test = pd.read_csv(data_location/'application_test.csv')
df_train = pd.read_csv('./application_train.csv')
df_bureau = pd.read_csv('./bureau_balance.csv')

However I keep getting errors and unable to proceed. I haven't found answers that can assist much.

PS: I am new to this AWS

Upvotes: 3

Views: 19548

Answers (5)

ishwardgret
ishwardgret

Reputation: 1146

import boto3

# files are referred as objects in S3.  
# file name is referred as key name in S3

def write_to_s3(filename, bucket_name, key):
    with open(filename,'rb') as f: # Read in binary mode
        return boto3.Session().resource('s3').Bucket(bucket).Object(key).upload_fileobj(f)

# Simple call the write_to_s3 function with required argument  

write_to_s3('file_name.csv', 
            bucket_name,
            'file_name.csv')

Upvotes: 0

openrory
openrory

Reputation: 71

In pandas 1.0.5, if you've already provided access to the notebook instance, reading a csv from S3 is as easy as this (https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#reading-remote-files):

df = pd.read_csv('s3://<bucket-name>/<filepath>.csv')

During the notebook setup process I attached a SageMakerFullAccess policy to the notebook instance granting it access to the S3 bucket. You can also do this via the IAM Management console.

If you need credentials, there's three ways to providing them (https://s3fs.readthedocs.io/en/latest/#credentials):

  • aws_access_key_id, aws_secret_access_key, and aws_session_token environment variables
  • configuration files such as ~/.aws/credentials
  • for nodes on EC2, the IAM metadata provider

Upvotes: 2

Gili Nachum
Gili Nachum

Reputation: 5568

You're trying to use Pandas to read files from S3 - Pandas can read files from your local disk, but not directly from S3.
Instead, download the files from S3 to your local disk, then use Pandas to read them.

import boto3
import botocore

BUCKET_NAME = 'my-bucket' # replace with your bucket name
KEY = 'my_image_in_s3.jpg' # replace with your object key

s3 = boto3.resource('s3')

try:
    # download as local file
    s3.Bucket(BUCKET_NAME).download_file(KEY, 'my_local_image.jpg')

    # OR read directly to memory as bytes:
    # bytes = s3.Object(BUCKET_NAME, KEY).get()['Body'].read() 
except botocore.exceptions.ClientError as e:
    if e.response['Error']['Code'] == "404":
        print("The object does not exist.")
    else:
        raise

Upvotes: 3

jmao
jmao

Reputation: 100

You can load S3 Data into AWS SageMaker Notebook by using the sample code below. Do make sure the Amazon SageMaker role has policy attached to it to have access to S3.

[1] https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html

import boto3 
import botocore 
import pandas as pd 
from sagemaker import get_execution_role 

role = get_execution_role() 

bucket = 'Your_bucket_name' 
data_key = your_data_file.csv' 
data_location = 's3://{}/{}'.format(bucket, data_key) 

pd.read_csv(data_location) 

Upvotes: 6

dennis-w
dennis-w

Reputation: 2156

You can use the https://s3fs.readthedocs.io/en/latest/ to read s3 files directly with pandas. The code below is taken from here

import os
import pandas as pd
from s3fs.core import S3FileSystem

os.environ['AWS_CONFIG_FILE'] = 'aws_config.ini'

s3 = S3FileSystem(anon=False)
key = 'path\to\your-csv.csv'
bucket = 'your-bucket-name'

df = pd.read_csv(s3.open('{}/{}'.format(bucket, key), mode='rb'))

Upvotes: 1

Related Questions