Reputation: 1687
Trying to get a better understanding of this: https://pymotw.com/2/multiprocessing/basics.html
I have 20+ "large" logs (each log is roughly 6-9gigs but compressed ie log1...20.gz)
My python script will go through each log in a assigned directory and work out the total for a particular column then write to a file and move on to the next log file. I noticed when I did this I was not using all the cores in the system. So to use more of the cores I did this
script1.py < folder 1 (contains logs 1-5 , write to report1-5.txt)
script2.py < folder 2 (contains logs 6-10, write to report6-10.txt)
script3.py < folder 3 (contains logs 11-15, write to report11-15.txt)
script4.py < folder 4 (contains logs 16-20, write to report16-20.txt
Ideally I would just have script1.py < folder 1 (contains all 20 logs and writes to report.txt)
If I enable "import multiprocessing" will I be able to achieve having 1 script and many workers going through the different files or will it be many workers trying to work on the sale log.gz file? or am I miss-interpreting the information
Upvotes: 4
Views: 6091
Reputation: 1564
Yea you are on right track. I do similar thing all the time. Runs much faster. YOu need to unzip the file first. Glob the files to pick up and pass them in a list of filenames to the pool.map(fn,lst) I should add an SSD is what I use and if you use a regular HD that spins, maybe there will be no speed improvement at all sadly. SSD is great though for this. Don't use Queue, close, join, all unnecessary, so just use map()
Upvotes: 0
Reputation: 11573
If I understand your question correctly then you're searching for a good way to speed up the processing of gzip-compressed logfiles.
The first question you need to answer is if your current process is CPU bound or IO bound. That means: When you currently run script.py < folder 1
, and watch it e.g. with top
, does your process go up to 100% CPU usage? If yes, then your process is CPU bound (i.e. CPU is the bottleneck). If this is the case then parallelization in python will help you. If it is not (and it's most certainly not as the disc will be your bottleneck, unless the gz files lie on different disks), then you don't need to bother as you won't get more speed out of this.
To parallelize you basically have two options:
python: you need to use multiprocessing, as you suggested. But to enable that, you cannot just import multiprocessing, you would need to explicitely say what each process needs to do, e.g.:
from multiprocessing import Pool, Queue
def crunch_file(queue):
while not queue.empty()
filename = queue.get()
# gunzip file, do processing, write to reportx-y.txt
if __name__ == '__main__':
queue = Queue()
# put all filenames into queue with os.walk() and queue.put(filename)
pool = Pool(None, crunch_file, (queue,))
pool.close() # signal that we won't submit any more tasks to pool
pool.join() # wait until all processes are done
A few things to note:
Pool(None, ...)
python will figure out the number of cores you have and will start one process per cpu coreQueue
helps you to never have idling processes: if one of the processes is done with his file, it will take the next in the queuebash: Since you seem to be unfamiliar with pythons multiprocessing, and the different processes don't need to talk to each other a lot easier would be to start e.g. 4 python programs in parallel e.g.
script.py < folder 1 &
script.py < folder 2 &
script.py < folder 3 &
script.py < folder 4 &
Upvotes: 6