Reputation: 4543
is there anyway to load/read an external file(i.e, AWS S3) in numpy?. I have several npy files stored in S3. I have tried to access them through a S3 presigned url but it seems neither numpy.load method or np.genfromtxt are able to read them.
I wouldn't want to save files on local file system and then load them on numpy.
Any idea?
Upvotes: 9
Views: 11778
Reputation: 419
I've compared s3fs and io.BytesIO for loading a 28G npz file from s3. s3fs takes 30 min while io takes 12 min.
obj = s3_session.resource("s3").Object(bucket, key)
with io.BytesIO(obj.get()["Body"].read()) as f:
f.seek(0) # rewind the file
X, y = np.load(f).values()
s3fs = S3FileSystem()
with s3fs.open(f"s3://{bucket}/{key}") as s3file:
X, y = np.load(s3file).values()
Upvotes: 7
Reputation: 936
Using s3fs
import numpy as np
from s3fs.core import S3FileSystem
s3 = S3FileSystem()
key = 'your_file.npy'
bucket = 'your_bucket'
df = np.load(s3.open('{}/{}'.format(bucket, key)))
You might have to set the allow_pickle=True
depending on your file to be read.
Upvotes: 13
Reputation: 920
I had success using boto and StringIO. Connect to S3 using boto and get your bucket. Then read the file with following code into numpy:
import numpy as np
from StringIO import StringIO
key=bucket.get_key('YOUR_KEY')
data_string=StringIO(key.get_contents_as_string())
data = np.load(data_string)
I am not sure it's the most efficient way, but it doesn't require a public URL.
Cheers, Michael
Upvotes: 1