TTT
TTT

Reputation: 4434

Python threading perforance

I would like to use python multiple threading capability for my app, but had some performance issue (my guess). The site is hosted on GAE and it talks to a REST server based on EC2 to do some calculations. The REST server is powered by bottlepy.

My question is: On the GAE side, I have a loop which calls the REST server multiple times to do the calculation. To improve performance, I use threading library. But I found some of the calculations are missing. Usually, I do not have this issue if only twenty jobs are fired, but I do have this issue when 200 jobs are fired. I appreciate any suggestions.

Here is my code:

def my_function():
    ...
    response = urlfetch.fetch(url=url, payload=data, method=urlfetch.POST, headers=http_headers, deadline=60)

#In this loop, I use the Thread to enable multiple threading
def loop_fun():
    for i in range(100):
        p=Thread(target = my_function)
        all_threads.append(p)
    # Start all threads
    [x.start() for x in all_threads]
    # Wait for all of them to finish
    [x.join() for x in all_threads]

Below is the error message for one job (usually I receive sever this type error message):

Exception in thread Thread-12:

Traceback (most recent call last):

  File "C:\Program Files (x86)\Google\google_appengine\google\appengine\dist27\threading.py", line 569, in __bootstrap_inner

    self.run()

  File "C:\Program Files (x86)\Google\google_appengine\google\appengine\dist27\threading.py", line 522, in run

    self.__target(*self.__args, **self.__kwargs)

  File "D:\Dropbox\ubertool_src\genee\genee_model.py", line 102, in __init__

    response = urlfetch.fetch(url=url, payload=data, method=urlfetch.POST, headers=http_headers, deadline=60)

  File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\urlfetch.py", line 270, in fetch

    return rpc.get_result()

  File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\apiproxy_stub_map.py", line 612, in get_result

    return self.__get_result_hook(self)

  File "C:\Program Files (x86)\Google\google_appengine\google\appengine\api\urlfetch.py", line 403, in _get_fetch_result

    raise DownloadError("Unable to fetch URL: " + url + error_detail)

DownloadError: Unable to fetch URL: http://url_20140122160100678000 Error: [Errno 10061] No connection could be made because the target machine actively refused it

Upvotes: 0

Views: 117

Answers (1)

mojo
mojo

Reputation: 4132

If the problem is one of overload, this problem might benefit from a "pool of workers" strategy.

import threading
import Queue    

def worker( jobs ):
    while True:
        url = jobs.get()
        if url is None:
            break

        # do stuff with the URL


if __name__ == '__main__':
    thread_count = 30

    job_q = Queue.Queue()

    pool = [ threading.Thread(target=worker,args=(job_q,))
             for i in range(thread_count) ]
    for p in pool:
        p.start()

    for url in urls_to_get:
        job_q.put(url)

    # Signal each thread that there are no more jobs.
    for p in pool:
        job_q.put(None)

    for p in pool:
        p.join()

This way, you can control how many simultaneous requests are taking place by limiting the quantity of threads.

FYI: Python is not really good at threading (depending on the interpreter). Some interpreters have a Global Interpreter Lock that prevent multiple threads from running at once. Threading works OK for I/O bound tasks, but not for making efficient use of the CPU. For simultaneity, use multiprocessing. The changes to my (untested) sample code above would be to use multiprocessing instead of threading and create a Process instead of a Thread.

Upvotes: 1

Related Questions