ashish14
ashish14

Reputation: 670

Uploading the multiples files in parallel to s3 using boto

http://ls.pwd.io/2013/06/parallel-s3-uploads-using-boto-and-threads-in-python/

I tried the second solution mentioned in the link to upload the multiple files to s3. The code mentioned in this link doesn't call method "join" on the threads which means main program can get terminated even though the threads are running. Using this approach the overall program gets executed much faster but doesn't guaranteee if the files are uploaded correctly or not. Is it really true? I am more concerned about the main program finishing fast? What side effects can be there using this approach?

Upvotes: 1

Views: 9152

Answers (1)

Sam Mason
Sam Mason

Reputation: 16174

just having a little play, and I see multiprocessing takes a while to tear down a Pool, but otherwise not much in it

test code is:

from time import time, sleep
from multiprocessing.pool import Pool, ThreadPool
from threading import Thread


N_WORKER_JOBS = 10


def worker(x):
    # print("working on", x)
    sleep(0.1)


def mp_proc(fn, n):
    start = time()
    with Pool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'Pool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def mp_threads(fn, n):
    start = time()
    with ThreadPool(N_WORKER_JOBS) as pool:
        t1 = time() - start
        pool.map(fn, range(n))
        start = time()
    t2 = time() - start
    print(f'ThreadPool creation took {t1*1000:.2f}ms, teardown {t2*1000:.2f}ms')


def threads(fn, n):
    threads = []
    for i in range(n):
        t = Thread(target=fn, args=(i,))
        threads.append(t)
        t.start()
    for t in threads:
        t.join()


for test in [mp_proc, mp_threads, threads]:
    times = []
    for _ in range(7):
        start = time()
        test(worker, 10)
        times.append(time() - start)

    times = ', '.join(f'{t*1000:.2f}' for t in times)
    print(f'{test.__name__} took {times}ms')

I get the following timings (Python 3.7.3, Linux 5.0.8):

  • mp_proc ~220ms
  • mp_threads ~200ms
  • threads ~100ms

however the teardown times are all ~100ms, which brings everything mostly into line.

I've poked around with logging and in the source, and it seems to be due to _handle_workers only checking every 100ms (it does status checks then sleeps for 0.1 seconds).

with this knowledge, I can change the code to sleep for 0.095 seconds, then everything is within 10% of each other. also, given that this is just once at pool tear down it's easy to arrange for this not to happen in an inner loop

Upvotes: 3

Related Questions