Mark_Anderson
Mark_Anderson

Reputation: 1324

Error in joblib.load when reading file from s3

When trying to read a file from s3 with joblib.load() I get the error ValueError: embedded null byte when attempting to read files.

The files were created by joblib and can be successfully loaded from local copies (that were made locally before uploading to s3), so the error is presumably in storage and retrieval protocols from S3.

Min code:

####Imports (AWS credentials assumed)
import boto3
from sklearn.externals import joblib


s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
joblib.loads(s3.Bucket(bucket_str).Object(bucket_key).get()['Body'].read())

Upvotes: 9

Views: 6999

Answers (3)

Jelmer Wind
Jelmer Wind

Reputation: 431

You can do it like this using the s3fs package.

import s3fs

fs = s3fs.S3FileSystem() # Updated method name
filename = "s3://<bucket_name>/<path_to>/<my_file>.joblib>"
with fs.open(filename, encoding='utf8') as fh:
    data = joblib.load(fh)

I guess everybody has their own preference but I really like s3fs because it makes the code look very familiar to people who haven't worked with s3 before.

Upvotes: 1

jtlz2
jtlz2

Reputation: 8407

@Jelmer Wind's answer is great but contains some errors - here is an updated version that was a bit long for a comment:

import s3fs

fs = s3fs.S3FileSystem() # Updated method name
filename = "s3://<bucket_name>/<path_to>/<my_file>.joblib>"
with fs.open(filename, encoding='utf8') as fh:
    data = joblib.load(fh)

Upvotes: 1

Mark_Anderson
Mark_Anderson

Reputation: 1324

The following code reconstructs a local copy of the file in memory before feeding into joblib.load(), enabling a successful load.

from io import BytesIO
import boto3
from sklearn.externals import joblib

s3 = boto3.resource('s3')
bucket_str = "my-aws-bucket"
bucket_key = "some-pseudo/folder-set/my-filename.joblib"
with BytesIO() as data:
    s3.Bucket(bucket_str).download_fileobj(bucket_key, data)
    data.seek(0)    # move back to the beginning after writing
    df = joblib.load(data)

I assume, but am not certain, that something in how boto3 chunks files for download creates a null byte that breaks joblib, and BytesIO fixes this before letting joblib.load() see the datastream.

PS. In this method the file never touches the local disk, which is helpful under some circumstances (eg. node with big RAM but tiny disk space...)

Upvotes: 13

Related Questions