Kshitij Marwah
Kshitij Marwah

Reputation: 1151

Reading contents of a gzip file from a AWS S3 in Python

I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.

bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.

Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.

This is what I have tried till now.

bucket_name = "name"
key = "y/z/stderr.gz"
obj = s3.Object(bucket_name,key)
n = obj.get()['Body'].read()

This is giving me a format which is not readable. I also tried

n = obj.get()['Body'].read().decode('utf-8')

which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.

I have also tried

gzip = StringIO(obj)
gzipfile = gzip.GzipFile(fileobj=gzip)
content = gzipfile.read()

This returns an error IOError: Not a gzipped file

Not sure how to decode this .gz file.

Edit - Found a solution. Needed to pass n in it and use BytesIO

gzip = BytesIO(n)

Upvotes: 45

Views: 71484

Answers (8)

Zac
Zac

Reputation: 720

Here is my way to read a gzip csv file from s3

import boto3
import gzip
import csv


response = s3.get_object(Bucket=bucket, Key=key)

# body is a StreamingBody object
s3_stream = response["Body"]

# open it in text mode
with gzip.open(s3_stream, mode='rt') as gz_file:
    reader = csv.reader(gz_file)
        
    # Iterate through the CSV rows
    for row in reader:
    ...

Upvotes: 1

greenwd
greenwd

Reputation: 303

I also stuck with reading contents of gzipped csv files from s3, got the same errors, but finally found a way to read a gzip.GZipFile and iterate through it's rows with csv.reader:

for obj in bucket.objects.filter(Prefix=folder_prefix):
    if obj.key.endswith(".gz"):
        with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipped_csv_file:
            csv_reader = csv.reader(StringIO(gzipped_csv_file.read().decode()))
            for line in csv_reader:
                process_line(line)

Upvotes: 2

Kirk
Kirk

Reputation: 1845

This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223 and python3.7)

import boto3
import gzip

s3 = boto3.resource("s3")
obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz")
with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:
    content = gzipfile.read()
print(content)

Upvotes: 45

Anjala Abdurehman
Anjala Abdurehman

Reputation: 75

Currently the file can be read as

import pandas as pd
role = 'role name'
bucket = 'bucket name'
data_key = 'data key'
data_location = 's3://{}/{}'.format(bucket, data_key)
data = pd.read_csv(data_location,compression='gzip', header=0, sep=',', quotechar='"') 

Upvotes: 1

rahulb
rahulb

Reputation: 1040

You can use AWS S3 SELECT Object Content to read gzip contents

S3 Select is an Amazon S3 capability designed to pull out only the data you need from an object, which can dramatically improve the performance and reduce the cost of applications that need to access data in S3.

Amazon S3 Select works on objects stored in Apache Parquet format, JSON Arrays, and BZIP2 compression for CSV and JSON objects.

Ref: https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html

from io import StringIO
import boto3
import pandas as pd

bucket = 'my-bucket'
prefix = 'my-prefix'

client = boto3.client('s3')

for object in client.list_objects_v2(Bucket=bucket, Prefix=prefix)['Contents']:
    if object['Size'] <= 0:
        continue

    print(object['Key'])
    r = client.select_object_content(
            Bucket=bucket,
            Key=object['Key'],
            ExpressionType='SQL',
            Expression="select * from s3object",
            InputSerialization = {'CompressionType': 'GZIP', 'JSON': {'Type': 'DOCUMENT'}},
            OutputSerialization = {'CSV': {'QuoteFields': 'ASNEEDED', 'RecordDelimiter': '\n', 'FieldDelimiter': ',', 'QuoteCharacter': '"', 'QuoteEscapeCharacter': '"'}},
        )

    for event in r['Payload']:
        if 'Records' in event:
            records = event['Records']['Payload'].decode('utf-8')
            payloads = (''.join(r for r in records))
            try:
                select_df = pd.read_csv(StringIO(payloads), error_bad_lines=False)
                for row in select_df.iterrows():
                    print(row)
            except Exception as e:
                print(e)

Upvotes: 10

amardip kumar
amardip kumar

Reputation: 27

Read Bz2 extension file from aws s3 in python

import json
import boto3
from io import BytesIO
import bz2
try:
    s3 = boto3.resource('s3')
    key='key_name.bz2'
    obj = s3.Object('bucket_name',key)
    nn = obj.get()['Body'].read()
    gzipfile = BytesIO(nn)
    content = bz2.decompress(gzipfile.read())
    content = content.split('\n')
    print len(content)

except Exception as e:
    print(e)

Upvotes: 1

Levi
Levi

Reputation: 221

@Amit, I was trying to do the same thing to test decoding a file, and got your code to run with some modifications. I just had to remove the function def, the return, and rename the gzip variable, since that name is in use.

import json
import boto3
from io import BytesIO
import gzip

try:
     s3 = boto3.resource('s3')
     key='YOUR_FILE_NAME.gz'
     obj = s3.Object('YOUR_BUCKET_NAME',key)
     n = obj.get()['Body'].read()
     gzipfile = BytesIO(n)
     gzipfile = gzip.GzipFile(fileobj=gzipfile)
     content = gzipfile.read()
     print(content)
except Exception as e:
    print(e)
    raise e

Upvotes: 21

Rez Moss
Rez Moss

Reputation: 4604

Just like what we do with variables, data can be kept as bytes in an in-memory buffer when we use the io module’s Byte IO operations.

Here is a sample program to demonstrate this:

mport io

stream_str = io.BytesIO(b"JournalDev Python: \x00\x01")
print(stream_str.getvalue())

The getvalue() function takes the value from the Buffer as a String.

So, the @Jean-FrançoisFabre answer is correct, and you should use

gzip = BytesIO(n)

For more information read the following doc:

https://docs.python.org/3/library/io.html

Upvotes: 0

Related Questions