Reputation: 15672
I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?
Upvotes: 13
Views: 36371
Reputation: 592
Adding on to @brice answer
Here is the code if you want to read any data inside the file line by line
with zipfile.ZipFile(tf, mode='r') as zipf:
for line in zipf.read("xyz.csv").split(b"\n"):
print(line)
break # to break off after the first line
Hope this helps!
Upvotes: 0
Reputation: 2955
import boto3
import os
import zipfile
import io
import json
'''
When you configure awscli, you\'ll set up a credentials file located at
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"
# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
file_contents= zipf.read(file_to_open).decode("utf-8")
print(file_contents)
reference from @brice answer.
Upvotes: 1
Reputation: 450
Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.
import pandas as pd
obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
# Use your byte stream, to, for example, print file names...
with zipfile.ZipFile(byte_stream, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)
Upvotes: 4
Reputation: 25029
Because working software > comprehensive documentation:
import zipfile
import boto
import io
# Connect to s3
# This will need your s3 credentials to be set up
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()
# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")
# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"
# Create an in-memory bytes IO buffer
with io.BytesIO() as b:
# Read the file into it
key.get_file(b)
# Reset the file pointer to the beginning
b.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(b, mode='r') as zipf:
for subfile in zipf.namelist():
do_stuff_with_subfile()
import zipfile
import boto3
import io
# this is just to demo. real use should use the config
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html
session = boto3.session.Session(
aws_access_key_id="ACCESSKEY",
aws_secret_access_key="SECRETKEY"
)
s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')
with io.BytesIO(obj.get()["Body"].read()) as tf:
# rewind the file
tf.seek(0)
# Read the file as a zipfile and process the members
with zipfile.ZipFile(tf, mode='r') as zipf:
for subfile in zipf.namelist():
print(subfile)
Tested on MacOSX with Python3.
Upvotes: 31
Reputation: 6186
I believe you have heard boto
which is Python interface to Amazon Web Services
You can get key
from s3
to file
.
import boto
import zipfile.ZipFile as ZipFile
s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file
with ZipFile(local_name, 'r') as myzip:
# do something with myzip
os.unlink(local_name) # delete it
You can also use tempfile
. For more detail, see create & read from tempfile
Upvotes: 1
Reputation: 3078
If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.
This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.
Note: This will only work if you are fine using EC2 instances.
Upvotes: 3