dorothy
dorothy

Reputation: 1243

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks

import threading    
import Queue
q = Queue.Queue()

def do_database(url):
    """ grab url then input to database """
    webdata = grab_url(url)
    try:
        insert_data_into_database(webdata)
    except:
        ....
    else:
        < do I need to do anything with the queue after each db operation is done?>

def put_queue(q, url ):
    q.put( do_database(url) )

for myfiles in currentdir:
    url = myfiles + some_other_string
    t=threading.Thread(target=put_queue,args=(q,url))
    t.daemon=True
    t.start()   

Upvotes: 2

Views: 161

Answers (3)

Tim Peters
Tim Peters

Reputation: 70735

It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.

The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)

Here's a pretty complete - but untested - sketch:

import threading
import Queue

NUM_THREADS = 5  # whatever

q = Queue.Queue()
END_OF_DATA = object()  # a unique object

class Worker(threading.Thread):
    def run(self):
        while True:
            url = q.get()
            if url is END_OF_DATA:
                break
            webdata = grab_url(url)
            try:
                # Does your database support concurrent updates
                # from multiple threads?  If not, need to put
                # this in a "with some_global_mutex:" block.
                insert_data_into_database(webdata)
            except:
                #....

threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
    t.start()

for myfiles in currentdir:
    url = myfiles + some_other_string
    q.put(url)

# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
    q.put(END_OF_DATA)

# Shut down cleanly.  `daemon` is way overused.
for t in threads:
    t.join()

Upvotes: 2

ledzep2
ledzep2

Reputation: 841

For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.

For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.

Here is a much more efficient pseudo implementation FYI:

import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue


def web_worker(q, url):
  grab_something
  q.push(result)

def db_worker(q):
  buf = []
  while True:
    buf.append(q.get())
    if len(buf) > 20:
      insert_stuff_in_buf_to_db
      db_commit
      buf = []

def run(urls):
  q = Queue()
  gevent.spawn(db_worker, q)
  for url in urls:
    gevent.spawn(web_worker, q, url)


run(urls)

plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

Upvotes: 1

John Zwinck
John Zwinck

Reputation: 249642

You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.

Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).

Upvotes: 2

Related Questions