martina.physics
martina.physics

Reputation: 9804

Boto3 read a file content from S3 key line by line

With boto3, you can read a file content from a location in S3, given a bucket name and the key, as per (this assumes a preliminary import boto3)

s3 = boto3.resource('s3')

content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read()

This returns a string type. The specific file I need to fetch happens to be a collection of dictionary-like objects, one per line. So it is not a JSON format. Instead of reading it as a string, I'd like to stream it as a file object and read it line by line; cannot find a way to do this other than downloading the file locally first as

s3 = boto3.resource('s3')

bucket = s3.Bucket(BUCKET_NAME)

filename = 'my-file'
bucket.download_file(S3_KEY, filename)

f = open('my-file')

What I'm asking is if it's possible to have this type of control on the file without having to download it locally first?

Upvotes: 10

Views: 21103

Answers (6)

EnzoMolion
EnzoMolion

Reputation: 1047

You also can take advantage of StreamingBody's iter_lines method:

for line in s3.Object(bucket, file).get()['Body'].iter_lines():
    decoded_line = line.decode('utf-b') # if decoding is needed

That would consume less memory than reading the whole line at once and then split it

Upvotes: 6

Ira Re
Ira Re

Reputation: 800

As of now you have a possibility to use the download_fileobj function. Here an example for a CSV file:

import boto3
import csv

bucket           = 'my_bucket' 
file_key         = 'my_key/file.csv'
output_file_path = 'output.csv'

s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)

#Dump binary in append mode
with open(output_file_path, 'ab') as file_object:
    bucket.download_fileobj(
        Key     = file_key, 
        Fileobj = file_object,
    )

#Read your file as usual
with open(output_file_path, 'r') as csvfile:
    lines = csv.reader(csvfile)
    for line in lines:
        doWhatEver(line[0])

Upvotes: 0

amcleod83
amcleod83

Reputation: 107

I found .splitlines() worked for me...

txt_file = s3.Object(bucket, file).get()['Body'].read().decode('utf-8').splitlines()

Without the .splitlines() the whole blob of text was return and trying to iterate each line resulted in each char being iterated. With .splitlines() iteration by line was achievable.

In my example here I iterate through each line and compile it into a dict.

txt_file = s3.Object(bucket, file).get()['Body'].read().decode(
        'utf-8').splitlines()

for line in txt_file:
    arr = line.split()
    print(arr)

Upvotes: 7

Christophe
Christophe

Reputation: 2012

The following comment from kooshiwoosh to a similar question provides a nice answer:

from io import TextIOWrapper
from gzip import GzipFile
...

# get StreamingBody from botocore.response
response = s3.get_object(Bucket=bucket, Key=key)
# if gzipped
gzipped = GzipFile(None, 'rb', fileobj=response['Body'])
data = TextIOWrapper(gzipped)

for line in data:
    # process line

Upvotes: 1

Harry
Harry

Reputation: 1091

This works for me:

json_object = s3.get_object(Bucket = bucket, Key = json_file_name)
json_file_reader = json_object['Body'].read()
content = json.loads(json_file_reader)

Upvotes: 0

Hezi Halpert
Hezi Halpert

Reputation: 9

This will do the work:

bytes_to_read = 512

content = s3.Object(BUCKET_NAME, S3_KEY).get()['Body'].read(bytes_to_read)

Upvotes: 0

Related Questions