Reputation: 43
I am trying to get the htmltext from nydaily news and other websites, but I can't get mechanize to timeout correctly. When the timeout is .01, it times out immediately, however when the timeout is something more reasonable (1.0), it runs for ~ 2 minutes before giving me this error:
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/monitor.py", line 575, in run
already_pickled=True)
File "/usr/lib/python2.7/dist-packages/spyderlib/utils/bsdsocket.py", line 24, in write_packet
sock.send(struct.pack("l", len(sent_data)) + sent_data)
error: [Errno 32] Broken pipe
import mechanize
br = mechanize.Browser()
url = 'http://www.nydailynews.com/services/feeds'
htmltext= br.open(url,timeout=1.0).read()
print htmltext[:200]
Upvotes: 1
Views: 1984
Reputation: 28036
There's something goofy going on with the way urllib2 is working in general (mechanize uses a fork)
Take a look at this:
#!/usr/bin/python
import time
import urllib2
import sys
def graburl(url,timeout):
urllib2.urlopen(url, timeout=float(timeout))
for i in range(1,30):
try:
start = time.time()
graburl("http://www.cnn.com:443", i)
except:
print 'Timeout: ', i, 'Duration: ', time.time() - start
When run:
Timeout: 1 Duration: 4.45208692551
Timeout: 2 Duration: 8.00451898575
Timeout: 3 Duration: 12.0053498745
Timeout: 4 Duration: 16.0044560432
Timeout: 5 Duration: 20.0762069225
Timeout: 6 Duration: 24.005065918
So the actual timeout ends up being 4x the timeout specified.
Note that in this specific case the connection is successful to the socket, but it just can't read the data correctly. (Or the request isn't serviced in a reasonable amount of time...)
If anyone can come up with a good reason why the timeout is multiplied by four I'd be very interested in what causes that.
Tested with python 2.7.5 on OSX Mavericks
Using socket.setdefaulttimeout() doesn't seem to change this behavior.
Upvotes: 1
Reputation: 12316
Those links take a long time to run even in a browser. Within Python I was able to load the subset http://feeds.nydailynews.com/nydnrss/sports
in about 16 seconds (w/o specifying a timeout).
I think you would need to set the timeout to something even more "reasonable" than one second to give it a chance to load, and I would choose a more focused feed than the main page where they are all listed. This link of top stories loads successfully for me with timeout=1
: http://feeds.nydailynews.com/nydnrss
Upvotes: 0