Reputation: 9804
With boto3, you can read a file content from a location in S3, given a bucket name and the key, as per (this assumes a preliminary import boto3
)
s3 = boto3.resource('s3')
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read()
This returns a string type. The specific file I need to fetch happens to be a collection of dictionary-like objects, one per line. So it is not a JSON format. Instead of reading it as a string, I'd like to stream it as a file object and read it line by line; cannot find a way to do this other than downloading the file locally first as
s3 = boto3.resource('s3')
bucket = s3.Bucket(BUCKET_NAME)
filename = 'my-file'
bucket.download_file(S3_KEY, filename)
f = open('my-file')
What I'm asking is if it's possible to have this type of control on the file without having to download it locally first?
Upvotes: 10
Views: 21103
Reputation: 1047
You also can take advantage of StreamingBody
's iter_lines
method:
for line in s3.Object(bucket, file).get()['Body'].iter_lines():
decoded_line = line.decode('utf-b') # if decoding is needed
That would consume less memory than reading the whole line at once and then split it
Upvotes: 6
Reputation: 800
As of now you have a possibility to use the download_fileobj function. Here an example for a CSV file:
import boto3
import csv
bucket = 'my_bucket'
file_key = 'my_key/file.csv'
output_file_path = 'output.csv'
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
bucket.download_fileobj(
Key = file_key,
Fileobj = file_object,
)
#Read your file as usual
with open(output_file_path, 'r') as csvfile:
lines = csv.reader(csvfile)
for line in lines:
doWhatEver(line[0])
Upvotes: 0
Reputation: 107
I found .splitlines() worked for me...
txt_file = s3.Object(bucket, file).get()['Body'].read().decode('utf-8').splitlines()
Without the .splitlines() the whole blob of text was return and trying to iterate each line resulted in each char being iterated. With .splitlines() iteration by line was achievable.
In my example here I iterate through each line and compile it into a dict.
txt_file = s3.Object(bucket, file).get()['Body'].read().decode(
'utf-8').splitlines()
for line in txt_file:
arr = line.split()
print(arr)
Upvotes: 7
Reputation: 2012
The following comment from kooshiwoosh to a similar question provides a nice answer:
from io import TextIOWrapper
from gzip import GzipFile
...
# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)
for line in data:
# process line
Upvotes: 1
Reputation: 1091
This works for me:
json_object = s3.get_object(Bucket = bucket, Key = json_file_name)
json_file_reader = json_object['Body'].read()
content = json.loads(json_file_reader)
Upvotes: 0
Reputation: 9
This will do the work:
bytes_to_read = 512
content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read(bytes_to_read)
Upvotes: 0