djangojack
djangojack

Reputation: 43

Python Mechanize timeout issues

I am trying to get the htmltext from nydaily news and other websites, but I can't get mechanize to timeout correctly. When the timeout is .01, it times out immediately, however when the timeout is something more reasonable (1.0), it runs for ~ 2 minutes before giving me this error:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 551, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/monitor.py", line 575, in run
    already_pickled=True)
  File "/usr/lib/python2.7/dist-packages/spyderlib/utils/bsdsocket.py", line 24, in write_packet
    sock.send(struct.pack("l", len(sent_data)) + sent_data)
error: [Errno 32] Broken pipe
import mechanize

br = mechanize.Browser()    
url = 'http://www.nydailynews.com/services/feeds'
htmltext= br.open(url,timeout=1.0).read()
print htmltext[:200]

Upvotes: 1

Views: 1984

Answers (2)

synthesizerpatel
synthesizerpatel

Reputation: 28036

There's something goofy going on with the way urllib2 is working in general (mechanize uses a fork)

Take a look at this:

#!/usr/bin/python

import time
import urllib2
import sys

def graburl(url,timeout):
    urllib2.urlopen(url, timeout=float(timeout))

for i in range(1,30):
    try:
        start = time.time()
        graburl("http://www.cnn.com:443", i)
    except:
        print 'Timeout: ', i, 'Duration: ', time.time() - start

When run:

Timeout:  1 Duration:  4.45208692551
Timeout:  2 Duration:  8.00451898575
Timeout:  3 Duration:  12.0053498745
Timeout:  4 Duration:  16.0044560432
Timeout:  5 Duration:  20.0762069225
Timeout:  6 Duration:  24.005065918

So the actual timeout ends up being 4x the timeout specified.

Note that in this specific case the connection is successful to the socket, but it just can't read the data correctly. (Or the request isn't serviced in a reasonable amount of time...)

If anyone can come up with a good reason why the timeout is multiplied by four I'd be very interested in what causes that.

Tested with python 2.7.5 on OSX Mavericks

Using socket.setdefaulttimeout() doesn't seem to change this behavior.

Upvotes: 1

beroe
beroe

Reputation: 12316

Those links take a long time to run even in a browser. Within Python I was able to load the subset http://feeds.nydailynews.com/nydnrss/sports in about 16 seconds (w/o specifying a timeout).

I think you would need to set the timeout to something even more "reasonable" than one second to give it a chance to load, and I would choose a more focused feed than the main page where they are all listed. This link of top stories loads successfully for me with timeout=1: http://feeds.nydailynews.com/nydnrss

Upvotes: 0

Related Questions