Reputation: 57
I have written a python based Mulitprocessing program to process .csv files. This program open each .csv files (around ~250 csv files) from a specific folder and add a new row in each files.
Issue – Time taken to process all csv files is slightly higher than normal sequential approach with multiprocessing approach. Generally multiprocessing should be faster than normal sequential process.
result Multiprocess Time taken: 0:00:00.748690 Normal Time Taken: 0:00:00.253856
Do you observe anything wrong with the code or testing approach ?
import multiprocessing
import csv
import datetime
import os
# Process CSV - add new row to CSV - Nomral sequential way
def process_csv_normal(param):
for p in param:
csv_file = p.get('workspace') + "\\" + p.get('file')
with open(csv_file, 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow({'AA001', 'AL', '[email protected]'})
# Main - Normal Process
def main_normal():
# path of csv files - Local machine folder path
workspace = r"C:\Workdir\Python\csvfolder"
params = [{'workspace': workspace, 'file': file_name} for file_name in os.listdir(workspace)
if file_name.endswith('.csv')]
process_csv_normal(params)
# Process CSV - add new row to CSV
def process_csv_multiprocess(param):
csv_file = param.get('workspace') + "\\" + param.get('file')
with open(csv_file, 'a') as csvfile:
writer = csv.writer(csvfile)
writer.writerow({'AA001', 'AL', '[email protected]'})
# Main - Multi process function
def main_multiprocess():
# path of csv files - Local machine folder path
workspace = r"C:\Workdir\Python\csvfolder"
# Number of files to process at a time
process = 1
params = [{'workspace': workspace, 'file': file_name} for file_name in os.listdir(workspace)
if file_name.endswith('.csv')]
pool = multiprocessing.Pool(processes=process)
pool.map_async(process_csv_multiprocess, params)
pool.close()
pool.join()
if __name__ == '__main__':
start=datetime.datetime.now()
main_multiprocess()
print("Multiprocess Time taken: " + str(datetime.datetime.now()-start))
start = datetime.datetime.now()
main_normal()
print("Normal Time Taken: " + str(datetime.datetime.now() - start))
Upvotes: 0
Views: 49
Reputation: 19123
Opening a file, adding a line and closing it is a very quick operation mostly bounded by disk access (i.e., it's not a CPU intensive operation). Disk access is in general not parallelizable (on HHD the head has to move to a specific position, on SSDs you need to request a certain block, etc), so multiprocessing won't help.
On top of that, you need to account for the overhead multiprocessing introduces: spawning a process pool, serializing, transferring and deserializing data, coordination, etc. All of this has a cost, which normally one balances with the benefit of having multiple CPU-intensive tasks running on multiple cores in parallel, which is not your case.
Upvotes: 1