Reputation: 1243
Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
Upvotes: 2
Views: 161
Reputation: 70735
It's odd that you're putting stuff into q
but never taking anything out of q
. What is the purpose of q
? In addition, since do_database()
doesn't return
anything, sure looks like the only thing q.put(do_database(url))
does is put None
into q
.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
Upvotes: 2
Reputation: 841
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.
Upvotes: 1
Reputation: 249642
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
Upvotes: 2