Using multiprocessing to process many files in parallel

I am trying to understand whether my way of using multiprocessing.Pool is efficient. The method that I would like to do in parallel is a script that reads a certain file, do calculation, and then saves the results to a different file. My code looks something like this:

from multiprocessing import Pool, TimeoutError
import deepdish.io as dd

def savefile(a,b,t,c,g,e,d):
    print a
    dd.save(str(a),{'b':b,'t':t,'c':c,'g':g,'e':e,'d':d})


def run_many_calcs():
    num_processors = 6
    print "Num processors - ",num_processors
    pool = Pool(processes=num_processors)     # start 4 worker processes
    for a in ['a','b','c','d','e','f','g','t','y','e','r','w']:
        pool.apply(savefile,args=(a,4,5,6,7,8,1))

How can I see that immediately after one process is finished in one of the processors it continues to the next file?

Upvotes: 0

Answers (1)

bazza

Reputation: 8434

When considering performance of any program, you have to work out if the performance is bound by I/O (memory, disk, network, whatever) or Compute (core count, core speed, etc).

If I/O is the bottleneck, there's no point having multiple processes, a faster CPU, etc.

If the computation is taking up all the time, then it is worth investing in multiple processes, etc. "Computation time" is often diagnosed as being the problem, but on closer investigation turns out to be limited by the computer's memory bus speed, not the clock rate of the cores. In such circumstances adding multiple processes can make things worse...

Check

You can check yours by doing some performance profiling of your code (there's bound to be a whole load of profiling tools out there for Python).

My Guess

Most of the time these days it's I/O that's the bottleneck. If you don't want to profile your code, betting on a faster SSD is likely the best initial approach.

Unsolvable Computer Science Problem

The architectural features of modern CPUs (L1, L2, L3 cache, QPI, hyperthreads) are all symptoms of the underlying problem in computer design; cores are too quick for the I/O we can wrap around them.

For example, the time taken to transfer 1 byte from SDRAM to the core is exceedingly slow in comparison to the core speed. One just has to hope that the L3, L2 and L1 cache subsystems have correctly predicted the need for that 1 byte and have already fetched it ahead of time. If not, there's a big delay; that's where hyperthreading can help the overall performance of the computer's other processes (they can nip in and get some work done), but does absolutely nothing for the stalled program.

Data fetched from files or networks is very slow indeed.

File System Caching

In your case it sounds like you have 1 single input file; that will at least get cached in RAM by the OS (provided it's not too big).

You may be tempted to read it into memory yourself; I wouldn't bother. If it's large you would be allocating a large amount of memory to hold it, and if that's too big for the RAM in the machine the OS will swap some of that RAM out to the virtual memory page file anyway, and you're worse off than before. If it's small enough there's a good chance the OS will cache the whole thing for you anyway, saving you the bother.

Written files are also cached, up to a point. Ultimately there's nothing you can do if "total process time" is taken to mean that all the data is written to disk; you'd be having to wait for the disk to complete writing no matter what you did in memory and what the OS cached.

The OS's filesystem cache might give an initial impression that file writing has completed (the OS will get on with consolidating the data on the actual drive shortly), but successive runs of the same program will get blocked once that write cache is full.

If you do profile your code, be sure to run it for a long time (or repeatedly), to make sure that the measurements made by the profiler reveal the true underlying performance of the computer. If the results show that most of the time is in file.Read() or file.Write()...

Upvotes: 1

Using multiprocessing to process many files in parallel

Answers (1)

Related Questions