Python spawns AWS CLI process for S3 upload and it becomes very slow

Question

My Python application creates a subprocess for AWS CLI S3 upload.

command = 'aws s3 sync /tmp/tmp_dir s3://mybucket/tmp_dir'
# spawn the process
sp = subprocess.Popen(
    shlex.split(str(command)),
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# wait for a while
sp.wait()
out, err = sp.communicate()

if sp.returncode == 0:
    logger.info("aws return code: %s", sp.returncode)
    logger.info("aws cli stdout `{}`".format(out))
    return

# handle error

/tmp/tmp_dir is ~0.5Gb and contains about 100 files. Upload process takes ~25 minutes, which is extremely slow.

If I run AWS command directly (without Python) it takes less than 1 minute.

What's wrong? Any help is appreciated.

Alex G Rice · Accepted Answer

I noticed a warning in the documentation about wait() usage (see below). However, instead of debugging this, why not rewrite it to use the Python SDK instead of shell out to aws cli? Probably you will get better performance and cleaner code.

https://boto3.readthedocs.io/en/latest/guide/s3.html

Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

https://docs.python.org/2/library/subprocess.html

edit3:

here is a solution which I just tested and it runs without blocking. There are convenience methods which use wait() or communicate() under the hood, which are easier to use, like check_output:

#!/usr/bin/env python
import subprocess
from subprocess import CalledProcessError

command = ['aws','s3','sync','/tmp/test-sync','s3://bucket-name/test-sync']
try:
    result = subprocess.check_output(command)
    print(result)
except CalledProcessError as err:
    # handle error, check err.returncode which is nonzero.
    pass

Python spawns AWS CLI process for S3 upload and it becomes very slow

Answers (1)

Related Questions