Austin
Austin

Reputation: 7319

Amazon Sagemaker open json from S3 bucket

I created a S3 bucket and placed both a data.csv and a data.json file inside it. I then created a Sagemaker notebook and specified this S3 bucket in the IAM role.

This now works from inside the notebook:

import pandas as pd
from sagemaker import get_execution_role

bucket='my-sagemaker-bucket'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location)

But this errors saying file doesn't exist:

import json
from sagemaker import get_execution_role

bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = json.load(open(data_location))

Anyone know why I can read the csv but not the json? I also can't shutil.copy the csv to the notebook's current working directory (also says file doesn't exist). I'm not very well versed with S3 buckets or Sagemaker, so not sure if this is a permissions/policy issue or something else.

Upvotes: 0

Views: 4516

Answers (2)

Han
Han

Reputation: 222

Pandas can handle S3 URL using your AWS credentials. So you could use pd.read_csv or pd.read_json instead of json.load. The suggestion from @Michael_S should work.

Upvotes: 1

Michael_S
Michael_S

Reputation: 35

your SageMaker-ExecutionRole might have insufficient rights to access your S3-bucket. The default IAM-SageMaker Execution role has the permission: "AmazonSageMakerFullAccess" which uses the S3 RequestCondition "s3:ExistingObjectTag/SageMaker = true".

So maybe you could try to simply tag your S3 bucket (Tag: SageMaker:true). Control your IAM settings.

import pandas as pd

bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_json(data_location) # , orient='columns', typ='series'

Upvotes: 2

Related Questions