I want to do batch processing of files on multiple cores. I have following scenario: I have 20 files. I have a function that takes a filename, processes it and produces an integer result. I want to apply the function to all the 20 files, calculate integer output for each of them and finally sum individual outputs and print the total result. Since I have 4 cores, I can only process 4 files at time. Thus I want to run 5 rounds of processing 4 files at a time (4*5 = 20). That is I want to create 4 processes each processing 5 files one after another (1st process processes files 0, 4, 8, 12, 16, 2nd process processes files 1, 5, 9, 13, 17, etc.). How do I achieve this? I am confused by multiprocessing.Pool() , multiprocessing.Process() and various other options. Thanks.

Reputation: 2409

Batch processing on multiple cores

I want to do batch processing of files on multiple cores. I have following scenario:

I have 20 files.
I have a function that takes a filename, processes it and produces an integer result. I want to apply the function to all the 20 files, calculate integer output for each of them and finally sum individual outputs and print the total result.
Since I have 4 cores, I can only process 4 files at time. Thus I want to run 5 rounds of processing 4 files at a time (4*5 = 20).
That is I want to create 4 processes each processing 5 files one after another (1st process processes files 0, 4, 8, 12, 16, 2nd process processes files 1, 5, 9, 13, 17, etc.).

How do I achieve this? I am confused by multiprocessing.Pool(), multiprocessing.Process() and various other options.

Thanks.

Upvotes: 5

Answers (3)

abarnert

Reputation: 365817

Although Jared's answer is great, I personally would use a ProcessPoolExecutor from the futures module, and not even worry about multiprocessing:

with ProcessPoolExecutor(max_workers=4) as executor:
    result = sum(executor.map(process_file, files))

When it gets a little more complicated, the future object, or futures.as_completed, can be really nifty compared to the multiprocessing equivalents. When it gets a lot more complicated, multiprocessing is a whole lot more flexible and powerful. But when it's this trivial, really, it's almost hard to tell the difference.

Upvotes: 2

Blender

Reputation: 298256

It's pretty simple.

from multiprocessing import Pool

def process_file(filename):
    return filename

if __name__ == '__main__':
    pool = Pool()
    files = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

    results = pool.imap(process_file, files)

    for result in results:
        print result

Pool automatically defaults to the number of processor cores that you have. Also, make sure that your processing function is importable from the file and that your multiprocessing code is inside of the if __name__ == '__main__':. If not, you'll make a fork bomb and lock up your computer.

Upvotes: 3

Jared

Reputation: 26397

In order to demonstrate Pool I'm going to suppose your working function, which consumes a filename and produces a number, is named work and that the 20 files are labeled 1.txt,... 20.txt. One way to set this up would be as follows,

from multiprocessing import Pool

pool = Pool(processes=4)
result = pool.map_async(work, ("%d.txt"%n for n in xrange(1,21)))
print sum(result.get())

This method will do the work of steps 3 and 4 for you.

Upvotes: 7

Batch processing on multiple cores

Answers (3)

Related Questions