Hyruma92
Hyruma92

Reputation: 876

Read multi object json gz file from S3 in python

I have some files in a S3 bucket and I'm trying to read them in the fastest possible way. The file's format is gzip and inside it, there is a single multi object json file like this:

{"id":"test1", "created":"2020-01-01", "lastUpdated":"2020-01-01T00:00:00.000Z"}
{"id":"test2", "created":"2020-01-01", "lastUpdated":"2020-01-01T00:00:00.000Z"}

What I want to do is load the json file and read every single object and process it. After some research this is the only code that it worked for me

import json
import gzip
import boto3
from io import BytesIO

s3 = boto3.resource('s3')
bucket = s3.Bucket("my-bucket")

for obj in bucket.objects.filter(Prefix='my-prefix').all():
    buffer = BytesIO(obj.get()['Body'].read())
    gzipfile = gzip.GzipFile(fileobj=buffer)
    for line in gzipfile:
        json_object = json.loads(line)
        # some stuff with the json_object

Anyone knows a better way to read the json object?

Thanks for helping

Upvotes: 4

Views: 7089

Answers (2)

ekmcd
ekmcd

Reputation: 182

After you have the buffer, try the following

decompressed = gzip.decompress(buffer)
json_lines = json.loads(decompressed)
for json_obj in json_lines:
    # Do stuff

Upvotes: 0

Hyruma92
Hyruma92

Reputation: 876

After some researches, I found the library smart-open very useful and simply to use.

from smart_open import open
import json

s3_client = s3_session.client("s3")
source_uri = 's3://my-bucket/my-path'
for json_line in open(source_uri, transport_params={"client": s3_client}):
    my_json = json.loads(json_line)

It uses a stream so you don't need to keep in memory the entire file when you read it. Furthermore, it handles different extensions so I don't need to care about the gz decompression.

Upvotes: 2

Related Questions