user1802143
user1802143

Reputation: 15672

Python S3 download zip file

I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?

Upvotes: 13

Views: 36371

Answers (6)

Mohseen Mulla
Mohseen Mulla

Reputation: 592

Adding on to @brice answer


Here is the code if you want to read any data inside the file line by line

with zipfile.ZipFile(tf, mode='r') as zipf:
    for line in zipf.read("xyz.csv").split(b"\n"):
        print(line)
        break # to break off after the first line

Hope this helps!

Upvotes: 0

nirojshrestha019
nirojshrestha019

Reputation: 2955

Reading certain file from a zip file from S3 bucket.

import boto3
import os
import zipfile
import io
import json


'''
When you configure awscli, you\'ll set up a credentials file located at 
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"

# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )

with io.BytesIO(obj.get()["Body"].read()) as tf:
    # rewind the file
    tf.seek(0)
    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        file_contents= zipf.read(file_to_open).decode("utf-8")
        print(file_contents)

reference from @brice answer.

Upvotes: 1

Teddy Ward
Teddy Ward

Reputation: 450

Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Upvotes: 4

brice
brice

Reputation: 25029

Because working software > comprehensive documentation:

Boto2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.

Upvotes: 31

emesday
emesday

Reputation: 6186

I believe you have heard boto which is Python interface to Amazon Web Services

You can get key from s3 to file.

import boto
import zipfile.ZipFile as ZipFile

s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file

with ZipFile(local_name, 'r') as myzip:
    # do something with myzip

os.unlink(local_name) # delete it

You can also use tempfile. For more detail, see create & read from tempfile

Upvotes: 1

DanGar
DanGar

Reputation: 3078

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.

This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.

Note: This will only work if you are fine using EC2 instances.

Upvotes: 3

Related Questions