Reputation: 1905
I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?
Upvotes: 81
Views: 84362
Reputation: 105
To upload a stream to s3 You can use Boto3 Session
r = requests.get(download_url, stream=True)
session = boto3.Session(aws_access_key_id=S3_ACCESS_KEY, aws_secret_access_key=S3_SECRET_KEY)
s3 = session.resource("s3")
bucket = s3.Bucket(UPLOAD_BUCKET_NAME)
bucket.upload_fileobj(r.raw, key)
This code can be used to call a URL which force downloads a file to get it as a stream and then upload it to s3.
Upvotes: 1
Reputation: 14883
There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.
Upvotes: 0
Reputation: 1688
Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())
Upvotes: -1
Reputation: 15861
According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO
in ordinary way
Update: smart_open lib from @inquiring minds answer is better solution
Upvotes: 3
Reputation: 155
We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
@action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})
Upvotes: 2
Reputation: 2259
There's an interesting code solution mentioned in a GitHub smart_open
issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3
is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3
S3 client's upload_fileobj()
method in conjunction with a target stream, not a file -- should work.
Upvotes: 0
Reputation: 6132
To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.
Upvotes: -4
Reputation: 1905
I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()
Upvotes: 58