Reputation:
I'm doing data scraping calls with an urllib2, yet they each take around 1 seconds to complete. I was trying to test if I could multi-thread the URL-call loop into threading with different offsets.
I'm doing this now with my update_items() method, where first and second parameter are the offset and limit to do loops:
import threading
t1 = threading.Thread(target=trade.update_items(1, 100))
t2 = threading.Thread(target=trade.update_items(101, 200))
t3 = threading.Thread(target=trade.update_items(201, 300))
t1.start()
t2.start()
t3.start()
#t1.join()
#t2.join()
#t3.join()
Like the code, I tried to commment out the join() to prevent waiting of the threads, but it seems I get the idea of this library wrong. I inserted print() functions into the update_items() method, funny tho it shows that it's still looping just in serial routine and not all 3 threads in parallel, like I wanted to achieve.
My normal scraping protocol takes about 5 hours to complete and it's only very small pieces of data, but the HTTP call always takes some time. I want to multi-thread this task at least a few times to shorten the fetching at least to around 30-45minutes.
Upvotes: 0
Views: 1499
Reputation: 414179
To get multiple urls in parallel limiting to 20 connections at a time:
import urllib2
from multiprocessing.dummy import Pool
def generate_urls(): # generate some dummy urls
for i in range(100):
yield 'http://example.com?param=%d' % i
def get_url(url):
try: return url, urllib2.urlopen(url).read(), None
except EnvironmentError as e:
return url, None, e
pool = Pool(20) # limit number of concurrent connections
for url, result, error in pool.imap_unordered(get_url, generate_urls()):
if error is None:
print result,
Upvotes: 3
Reputation: 98469
Paul Seeb has correctly diagnosed your issue.
You are calling trade.update_items
, and then passing the result to the threading.Thread
constructor. Thus, you get serial behavior: your threads don't do any work, and the creation of each one is delayed until the update_items
call returns.
The correct form is threading.Thread(target=trade.update_items, args=(1, 100)
for the first line, and similarly for the later ones. This will pass the update_items
function as the thread entry point, and the *[1, 100]
as its positional arguments.
Upvotes: 2