Reputation: 1455
I use Python. I have 100 zip files. Each zipfile contains more than 100 xmlfiles. Using the xmlfiles I create csvfiles.
from xml.etree.ElementTree import fromstring
import zipfile
from multiprocessing import Process
def parse_xml_for_csv1(data, writer1):
root = fromstring(data)
for node in root.iter('name'):
writer1.writerow(node.get('value'))
def create_csv1():
with open('output1.csv', 'w') as f1:
writer1 = csv.writer(f1)
for i in range(1, 100):
z = zipfile.ZipFile('xml' + str(i) + '.zip')
# z.namelist() contains more than 100 xml files
for finfo in z.namelist():
data = z.read(finfo)
parse_xml_for_csv1(data, writer1)
def create_csv2():
with open('output2.csv', 'w') as f2:
writer2 = csv.writer(f2)
for i in range(1, 100):
...
if __name__ == "__main__":
p1 = Process(target=create_csv1)
p2 = Process(target=create_csv2)
p1.start()
p2.start()
p1.join()
p2.join()
Please tell me, how to optimize my code? Make the code faster?
Upvotes: 0
Views: 897
Reputation: 140168
You just need to define one method, with parameters. Split the processing of your 100 .zip files across a given number of threads or processes. The more processes you'll add, the more CPU you'll use, and maybe you can use more than 2 processes, it will be faster (there can be a bottleneck because of disk I/O at some point)
In the following code, I can change to 4 or 10 processes, no need to copy/paste code. And it processes different zip files.
Your code processes the same 100 files twice in parallel: it was even slower than if there were no multiprocessing!
def create_csv(start_index,step):
with open('output{0}.csv'.format(start_index//step), 'w') as f1:
writer1 = csv.writer(f1)
for i in range(start_index, start_index+step):
z = zipfile.ZipFile('xml' + str(i) + '.zip')
# z.namelist() contains more than 100 xml files
for finfo in z.namelist():
data = z.read(finfo)
parse_xml_for_csv1(data, writer1)
if __name__ == "__main__":
nb_files = 100
nb_processes = 2 # raise to 4 or 8 depending on your machine
step = nb_files//nb_processes
lp = []
for start_index in range(1,nb_files,step):
p = Process(target=create_csv,args=[start_index,step])
p.start()
lp.append(p)
for p in lp:
p.join()
Upvotes: 3