Reputation: 443
I am quite new to Python and I need to implement multithreading in my code.
I have a huge .csv file (million lines) as my input. I read the line, make a rest request for each line, do some processing on each line and write the output into another file. The ordering of lines in input/output file does matter . Right now I am doing this line by line. I want to run the same code, but in parallel, i.e read 20 lines of input from .csv file and make the rest call in parallel so that my program is faster.
I have been reading up on http://docs.python.org/2/library/queue.html, but I read about the python GIL issue which says the code will not run faster even after multithreading. Is there any other way to achieve multithreading in a simple way?
Upvotes: 0
Views: 339
Reputation: 414825
Python releases GIL on IO. If most of the time is spent doing rest requests; you could use threads to speed up processing:
try:
from gevent.pool import Pool # $ pip install gevent
import gevent.monkey; gevent.monkey.patch_all() # patch stdlib
except ImportError: # fallback on using threads
from multiprocessing.dummy import Pool
import urllib2
def process_line(url):
try:
return urllib2.urlopen(url).read(), None
except EnvironmentError as e:
return None, e
with open('input.csv', 'rb') as file, open('output.txt', 'wb') as outfile:
pool = Pool(20) # use 20 concurrent connections
for result, error in pool.imap_unordered(process_line, file):
if error is None:
outfile.write(result)
If input/output order should be the same; you could use imap
instead of imap_unordered
.
If your program is CPU-bound; you could use multiprocessing.Pool()
that creates multiple processes instead.
See also Python Interpreter blocks Multithreaded DNS requests?
This answer shows how to create a thread pool manually using threading + Queue modules.
Upvotes: 1
Reputation: 4806
Can you break the .csv file into multiple smaller files? If you can, then you could have another program running multiple versions of your processer.
Say the files were all named file1, file2, etc. and your processer took the filename as an argument. You could have:
import subprocess
import os
import signal
for i in range(1,numfiles):
program = subprocess.Popen(['python'], "processer.py", "file" + str(i))
pid = program.pid
#if you need to kill the process:
os.kill(pid, signal.SIGINT)
Upvotes: 2