Reputation: 115
I have multiple data files that I process using python Pandas libraries. Each file is processed one by one, and only one logical processor is used when I look at Task manager (it is at ~95%, and the rest are within 5%)
Is there a way to process data files simultaneously? If so, is there a way to utilize the other logic processors to do that?
(Edits are welcome)
Upvotes: 1
Views: 3679
Reputation: 1282
If your file names are in a list, you could use this code:
from multiprocessing import Process
def YourCode(filename, otherdata):
# Do your stuff
if __name__ == '__main__':
#Post process files in parallel
ListOfFilenames = ['file1','file2', ..., 'file1000']
ListOfProcesses = []
Processors = 20 # n of processors you want to use
#Divide the list of files in 'n of processors' Parts
Parts = [ListOfFilenames[i:i + Processors] for i in xrange(0, len(ListOfFilenames), Processors)]
for part in Parts:
for f in part:
p = multiprocessing.Process(target=YourCode, args=(f, otherdata))
p.start()
ListOfProcesses.append(p)
for p in ListOfProcesses:
p.join()
Upvotes: 1
Reputation: 2843
You can process the different files in different threads or in different processes.
The good thing of python is that its framework provides tools for you to do this:
from multiprocessing import Process
def process_panda(filename):
# this function will be started in a different process
process_panda_import()
write_results()
if __name__ == '__main__':
p1 = Process(target=process_panda, args=('file1',))
# start process 1
p1.start()
p2 = Process(target=process_panda, args=('file2',))
# starts process 2
p2.start()
# waits if process 2 is finished
p2.join()
# waits if process 1 is finished
p1.join()
The program will start 2 child-processes, which can be used do process your files. Of cource you can do something similar with threads.
You can find the documentation here: https://docs.python.org/2/library/multiprocessing.html
and here:
https://pymotw.com/2/threading/
Upvotes: 0