multiprocess library barely works

Question

I'm using the multiprocess library to accelerate a CPU-bound task (a method inside a user-defined class).

The function processes a page of a document, in my example a 500-page document takes around 20 seconds sequentially (so around 0.04 seconds per page). I'm incrementing an index until 2,000,000 to simulate this task.

dummy.py

from multiprocess import Pool

class DummyClass:

    def __init__(self, workers=1):
        self.workers = workers

    # Simulate CPU-intensive task
    def _process_one(self, page):
        count = 0
        while count < 2_000_000:
            count += 1
        return page

    # Process with "multiprocess"
    def multiprocess(self, pages):
        with Pool(processes=self.workers) as pool:
            async_results = pool.map_async(self._process_one, pages)
            extraction = async_results.get()
            return extraction
        
    # Process sequentially
    def sequential(self, pages):
        extraction = []
        for page in pages:
            extraction.append(self._process_one(page))
        return extraction

test.py

import time
from dummy import DummyClass

# Sequential with dummy method

def dummy_sequential():
    
    dummy_extractor = DummyClass()
    extraction = dummy_extractor.sequential(range(500))
    return extraction


# Multiprocessing with dummy method

def dummy_multiprocess(workers):

    dummy_extractor = DummyClass(workers=workers)
    extraction = dummy_extractor.multiprocess(range(500))
    return extraction

Testing sequential:

if __name__ == "__main__":

    ini = time.time()
    extraction = dummy_sequential()
    fin = time.time()
    print("Time: ", fin - ini, "seconds")

Prints out:

Time:  19.12088394165039 seconds

Testing multiprocess with different values:

if __name__ == "__main__":

    for i in range(2, 9):
        ini = time.time()
        extraction = dummy_multiprocess(workers=i)
        fin = time.time()
        print(f"Time with {i} workers", fin - ini, "seconds")

Prints out:

Time with 2 workers 13.7001051902771 seconds
Time with 3 workers 11.189585208892822 seconds
Time with 4 workers 11.595974683761597 seconds
Time with 5 workers 12.016109228134155 seconds
Time with 6 workers 12.690005540847778 seconds
Time with 7 workers 13.012137651443481 seconds
Time with 8 workers 13.412734508514404 seconds

So we can see 3 workers is the optimal number of workers, while with more workers it slowly climbs up again.

However this is a process that needs to be as fast as possible. If a 500-page document takes 20 seconds, I would like it to go under 2 seconds (my computer has 16 CPU cores). The fastest I can get now is 11 seconds.

I understand this process has some overhead, but this seems too much. Is there no other way to make it faster?

Thank you

New times:

Okay now I got the following times:

Using "multiprocess"

sequential: 18.77s
2 workers: 10.71s
3 workers: 7.68s
4 workers: 8.11s
5 workers: 7.36s
6 workers: 6.99s
7 workers: 7.18s
8 workers: 7.20s

Using "multiprocessing"

sequential: 18.76s
2 workers: 10.34s
3 workers: 7.27s
4 workers: 6.55s
5 workers: 6.70s
6 workers: 6.80s
7 workers: 6.26s
8 workers: 6.30s

So now it decreases further, but still far away from times by other colleagues.

I'm using multiprocess v. 0.70.17, Python 3.12.9, Windows 11, 16 cores.

multiprocess library barely works

Answers (0)

Related Questions