fowtom
fowtom

Reputation: 93

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.

This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:

    s3 = boto3.resource('s3')
    data = '{"some":"json","test":"data"}'
    key = "path/to/my/file/[timestamp].json"      
    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:

    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

Upvotes: 7

Views: 10068

Answers (1)

dmulter
dmulter

Reputation: 2758

Since you are potentially uploading many small files, you should consider a few items:

  • Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
  • Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
  • S3 performance implications as described in Request Rate and Performance Considerations

Upvotes: 4

Related Questions