zest16
zest16

Reputation: 645

multiprocess library barely works

I'm using the multiprocess library to accelerate a CPU-bound task (a method inside a user-defined class).

The function processes a page of a document, in my example a 500-page document takes around 20 seconds sequentially (so around 0.04 seconds per page). I'm incrementing an index until 2,000,000 to simulate this task.

dummy.py

from multiprocess import Pool

class DummyClass:

    def __init__(self, workers=1):
        self.workers = workers

    # Simulate CPU-intensive task
    def _process_one(self, page):
        count = 0
        while count < 2_000_000:
            count += 1
        return page

    # Process with "multiprocess"
    def multiprocess(self, pages):
        with Pool(processes=self.workers) as pool:
            async_results = pool.map_async(self._process_one, pages)
            extraction = async_results.get()
            return extraction
        
    # Process sequentially
    def sequential(self, pages):
        extraction = []
        for page in pages:
            extraction.append(self._process_one(page))
        return extraction

test.py

import time
from dummy import DummyClass

# Sequential with dummy method

def dummy_sequential():
    
    dummy_extractor = DummyClass()
    extraction = dummy_extractor.sequential(range(500))
    return extraction


# Multiprocessing with dummy method

def dummy_multiprocess(workers):

    dummy_extractor = DummyClass(workers=workers)
    extraction = dummy_extractor.multiprocess(range(500))
    return extraction

Testing sequential:

if __name__ == "__main__":

    ini = time.time()
    extraction = dummy_sequential()
    fin = time.time()
    print("Time: ", fin - ini, "seconds")

Prints out:

Time:  19.12088394165039 seconds

Testing multiprocess with different values:

if __name__ == "__main__":

    for i in range(2, 9):
        ini = time.time()
        extraction = dummy_multiprocess(workers=i)
        fin = time.time()
        print(f"Time with {i} workers", fin - ini, "seconds")

Prints out:

Time with 2 workers 13.7001051902771 seconds
Time with 3 workers 11.189585208892822 seconds
Time with 4 workers 11.595974683761597 seconds
Time with 5 workers 12.016109228134155 seconds
Time with 6 workers 12.690005540847778 seconds
Time with 7 workers 13.012137651443481 seconds
Time with 8 workers 13.412734508514404 seconds

So we can see 3 workers is the optimal number of workers, while with more workers it slowly climbs up again.

However this is a process that needs to be as fast as possible. If a 500-page document takes 20 seconds, I would like it to go under 2 seconds (my computer has 16 CPU cores). The fastest I can get now is 11 seconds.

I understand this process has some overhead, but this seems too much. Is there no other way to make it faster?

Thank you


New times:

Okay now I got the following times:

Using "multiprocess"

sequential: 18.77s
2 workers: 10.71s
3 workers: 7.68s
4 workers: 8.11s
5 workers: 7.36s
6 workers: 6.99s
7 workers: 7.18s
8 workers: 7.20s

Using "multiprocessing"

sequential: 18.76s
2 workers: 10.34s
3 workers: 7.27s
4 workers: 6.55s
5 workers: 6.70s
6 workers: 6.80s
7 workers: 6.26s
8 workers: 6.30s

So now it decreases further, but still far away from times by other colleagues.

I'm using multiprocess v. 0.70.17, Python 3.12.9, Windows 11, 16 cores.

Upvotes: 0

Views: 92

Answers (0)

Related Questions