Reputation: 117
I'm trying to read a file with pandas from an s3 bucket without downloading the file to the disk. I've tried to use boto3 for that as
import boto3
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucket_name', Key="key")
read_file = io.BytesIO(obj['Body'].read())
pd.read_csv(read_file)
And also I've tried s3fs as
import s3fs
import pandas as pd
fs = s3fs.S3FileSystem(anon=False)
with fs.open('bucket_name/path/to/file.csv', 'rb') as f:
df = pd.read_csv(f)`
The issue is it takes too long to read the file. It takes about 3 minutes to read 38MB file. Is it supposed to be like that? If it is, then is there any faster way to do the same. If it's not, any suggestions what might cause the issue?
Thanks!
Upvotes: 4
Views: 7335
Reputation: 76
Based on this answer to a similar issue, you might want to consider what region the bucket you're reading from is in, compared to where you're reading it from. Might be a simple change (assuming you have control over the buckets location) which could improve the performance drastically.
Upvotes: 3