python multiprocessing across different files

Question

Trying to get a better understanding of this: https://pymotw.com/2/multiprocessing/basics.html

I have 20+ "large" logs (each log is roughly 6-9gigs but compressed ie log1...20.gz)

My python script will go through each log in a assigned directory and work out the total for a particular column then write to a file and move on to the next log file. I noticed when I did this I was not using all the cores in the system. So to use more of the cores I did this

script1.py  < folder 1 (contains logs 1-5 , write to report1-5.txt) 
script2.py  < folder 2 (contains logs 6-10, write to report6-10.txt)
script3.py  < folder 3 (contains logs 11-15, write to report11-15.txt)
script4.py  < folder 4 (contains logs 16-20, write to report16-20.txt

Ideally I would just have script1.py < folder 1 (contains all 20 logs and writes to report.txt)

If I enable "import multiprocessing" will I be able to achieve having 1 script and many workers going through the different files or will it be many workers trying to work on the sale log.gz file? or am I miss-interpreting the information

hansaplast · Accepted Answer

If I understand your question correctly then you're searching for a good way to speed up the processing of gzip-compressed logfiles.

The first question you need to answer is if your current process is CPU bound or IO bound. That means: When you currently run script.py < folder 1, and watch it e.g. with top, does your process go up to 100% CPU usage? If yes, then your process is CPU bound (i.e. CPU is the bottleneck). If this is the case then parallelization in python will help you. If it is not (and it's most certainly not as the disc will be your bottleneck, unless the gz files lie on different disks), then you don't need to bother as you won't get more speed out of this.

To parallelize you basically have two options:

python: you need to use multiprocessing, as you suggested. But to enable that, you cannot just import multiprocessing, you would need to explicitely say what each process needs to do, e.g.:

from multiprocessing import Pool, Queue

def crunch_file(queue):
    while not queue.empty()
        filename = queue.get()
        # gunzip file, do processing, write to reportx-y.txt

if __name__ == '__main__':
    queue = Queue()
    # put all filenames into queue with os.walk() and queue.put(filename)
    pool = Pool(None, crunch_file, (queue,))
    pool.close() # signal that we won't submit any more tasks to pool
    pool.join() # wait until all processes are done

A few things to note:

using Pool(None, ...) python will figure out the number of cores you have and will start one process per cpu core
using Queue helps you to never have idling processes: if one of the processes is done with his file, it will take the next in the queue

bash: Since you seem to be unfamiliar with pythons multiprocessing, and the different processes don't need to talk to each other a lot easier would be to start e.g. 4 python programs in parallel e.g.
```
script.py  < folder 1 &
script.py  < folder 2 &
script.py  < folder 3 &
script.py  < folder 4 &
```

python multiprocessing across different files

Answers (2)

Related Questions