Reputation: 7319
I created a S3 bucket and placed both a data.csv
and a data.json
file inside it. I then created a Sagemaker notebook and specified this S3 bucket in the IAM role.
This now works from inside the notebook:
import pandas as pd
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location)
But this errors saying file doesn't exist:
import json
from sagemaker import get_execution_role
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = json.load(open(data_location))
Anyone know why I can read the csv but not the json? I also can't shutil.copy
the csv to the notebook's current working directory (also says file doesn't exist). I'm not very well versed with S3 buckets or Sagemaker, so not sure if this is a permissions/policy issue or something else.
Upvotes: 0
Views: 4516
Reputation: 222
Pandas can handle S3 URL using your AWS credentials. So you could use pd.read_csv
or pd.read_json
instead of json.load
. The suggestion from @Michael_S should work.
Upvotes: 1
Reputation: 35
your SageMaker-ExecutionRole might have insufficient rights to access your S3-bucket. The default IAM-SageMaker Execution role has the permission: "AmazonSageMakerFullAccess" which uses the S3 RequestCondition "s3:ExistingObjectTag/SageMaker = true".
So maybe you could try to simply tag your S3 bucket (Tag: SageMaker:true). Control your IAM settings.
import pandas as pd
bucket='my-sagemaker-bucket'
data_key = 'data.json'
data_location = 's3://{}/{}'.format(bucket, data_key)
pd.read_json(data_location) # , orient='columns', typ='series'
Upvotes: 2