Reputation: 7056
Iam trying to follow the multithreading example given in: Python urllib2.urlopen() is slow, need a better way to read several urls but I seem to get a "thread error" and I am not sure what this really means.
urlList=[list of urls to be fetched]*100
def read_url(url, queue):
my_data=[]
try:
data = urllib2.urlopen(url,None,15).read()
print('Fetched %s from %s' % (len(data), url))
my_data.append(data)
queue.put(data)
except HTTPError, e:
data = urllib2.urlopen(url).read()
print('Fetched %s from %s' % (len(data), url))
my_data.append(data)
queue.put(data)
def fetch_parallel():
result = Queue.Queue()
threads = [threading.Thread(target=read_url, args = (url,result)) for url in urlList]
for t in threads:
t.start()
for t in threads:
t.join()
return result
res=[]
res=fetch_parallel()
reslist = []
while not res.empty: reslist.append(res.get())
print (reslist)
I get the following first error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 76, in read_url
print('Fetched %s from %s' % (len(data), url))
TypeError: object of type 'instancemethod' has no len()
On the other hand, I see that sometimes, it does seem to fetch data, but then I get the following second error:
Traceback (most recent call last):
File "demo.py", line 89, in <module>
print str(res[0])
AttributeError: Queue instance has no attribute '__getitem__'
When it fetches data, why is the result not showing up in res[]? Thanks for your time.
Update After changing read to read() in the read_url() function, although the situation has improved (I now get many page fetches), but still got the error:
Exception in thread Thread-86:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "demo.py", line 75, in read_url
data = urllib2.urlopen(url).read()
File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 429, in error
result = self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "/usr/lib/python2.7/urllib2.py", line 397, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 435, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 502: Bad Gateway
Upvotes: 0
Views: 8348
Reputation: 287795
Note that urllib2 is not thread-safe. Therefore, you should really use urllib3.
Some of your problems are entirely unrelated to threading. Threads just make the error reporting more complex. Instead of
data = urllib2.urlopen(url).read
you want
data = urllib2.urlopen(url).read()
# ^^
A 502 Bad gateway
error indicates a server misconfiguration (most likely, an internal server of the web service you're connecting to is rebooting / not available). There's nothing you can do about it - the URL is just not reachable right now. Use try..except
to handle these errors, for example by printing a diagnostic message, or scheduling the URL to be retrieved after an appropriate waiting period, or by leaving out the failed data set.
To get the values from the queue, you can do the following:
res = fetch_parallel()
reslist = []
while not res.empty():
reslist.append(res.get_nowait()) # or get, doesn't matter here
print (reslist)
There is also no way around real error handling in case a URL is really unreachable. Simply re-requesting it might work in some cases, but you must be able to handle the case that the remote host is truly unreachable at this time. How you do that depends on your application's logic.
Upvotes: 4