Andrii Skaliuk
Andrii Skaliuk

Reputation: 438

Python spawns AWS CLI process for S3 upload and it becomes very slow

My Python application creates a subprocess for AWS CLI S3 upload.

command = 'aws s3 sync /tmp/tmp_dir s3://mybucket/tmp_dir'
# spawn the process
sp = subprocess.Popen(
    shlex.split(str(command)),
    stdout=subprocess.PIPE, stderr=subprocess.PIPE)
# wait for a while
sp.wait()
out, err = sp.communicate()

if sp.returncode == 0:
    logger.info("aws return code: %s", sp.returncode)
    logger.info("aws cli stdout `{}`".format(out))
    return

# handle error

/tmp/tmp_dir is ~0.5Gb and contains about 100 files. Upload process takes ~25 minutes, which is extremely slow.

If I run AWS command directly (without Python) it takes less than 1 minute.

What's wrong? Any help is appreciated.

Upvotes: 1

Views: 1460

Answers (1)

Alex G Rice
Alex G Rice

Reputation: 1579

I noticed a warning in the documentation about wait() usage (see below). However, instead of debugging this, why not rewrite it to use the Python SDK instead of shell out to aws cli? Probably you will get better performance and cleaner code.

https://boto3.readthedocs.io/en/latest/guide/s3.html

Warning This will deadlock when using stdout=PIPE and/or stderr=PIPE and the child process generates enough output to a pipe such that it blocks waiting for the OS pipe buffer to accept more data. Use communicate() to avoid that.

https://docs.python.org/2/library/subprocess.html

edit3:

here is a solution which I just tested and it runs without blocking. There are convenience methods which use wait() or communicate() under the hood, which are easier to use, like check_output:

#!/usr/bin/env python
import subprocess
from subprocess import CalledProcessError

command = ['aws','s3','sync','/tmp/test-sync','s3://bucket-name/test-sync']
try:
    result = subprocess.check_output(command)
    print(result)
except CalledProcessError as err:
    # handle error, check err.returncode which is nonzero.
    pass

Upvotes: 1

Related Questions