Reputation: 35
I'm trying to write a program that will (among other things) get text or source code from a predetermined website. I'm learning Python to do this, and most sources have told me to use urllib2
. Just as a test, I tried this code:
import urllib2
response = urllib2.urlopen('http://www.python.org')
html = response.read()
Instead of acting in any expected way, the shell just sits there, like it's waiting for some input. There aren't even an ">>>"
or "...
". The only way to exit this state is with [ctrl]+c. When I do this, I get a whole bunch of error messages, like
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "/m/mls/pkg/ix86-Linux-RHEL5/lib/python2.5/urllib2.py", line 381, in open
response = self._open(req, data)
I'd appreciate any feedback. Is there a different tool than urllib2
to use, or can you give advice on how to fix this. I'm using a network computer at my work, and I'm not entirely sure how the shell is configured or how that might affect anything.
Upvotes: 2
Views: 6105
Reputation: 1823
With 99.999% probability, it's a proxy issue. Python is incredibly bad at detecting the right http proxy to use, and when it cannot find the right one, it just hangs and eventually times out.
So first you have to find out which proxy should be used, check the options of your browser (Tools -> Internet Options -> Connections -> LAN Setup... in IE, etc). If it's using a script to autoconfigure, you'll have to fetch the script (which should be some sort of javascript) and find out where your request is supposed to go. If there is no script specified and the "automatically determine" option is ticked, you might as well just ask some IT guy at your company.
I assume you're using Python 2.x. From the Python docs on urllib
:
# Use http://www.someproxy.com:3128 for http proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
Note that the point on ProxyHandler figuring out default values is what happens already when you use urlopen
, so it's probably not going to work.
If you really want urllib2, you'll have to specify a ProxyHandler, like the example in this page. Authentication might or might not be required (usually it's not).
Upvotes: 4
Reputation: 22841
This isn't a good answer to "How to do this with urllib2", but let me suggest python-requests. The whole reason it exists is because the author found urllib2 to be an unwieldy mess. And he's probably right.
Upvotes: 3
Reputation: 67073
I get a 404 error almost immediately (no hanging):
>>> import urllib2
>>> response = urllib2.urlopen('http://www.python.org/fish.html')
Traceback (most recent call last):
...
urllib2.HTTPError: HTTP Error 404: Not Found
If I try and contact an address that doesn't have an HTTP server running, it hangs for quite a while until the timeout happens. You can shorten it by passing the timeout parameter to urlopen:
>>> response = urllib2.urlopen('http://cs.princeton.edu/fish.html', timeout=5)
Traceback (most recent call last):
...
urllib2.URLError: <urlopen error timed out>
Upvotes: 0
Reputation: 1406
That is very weird, have you tried a different URL?
Otherwise there is HTTPLib, however it is more complicated. Here's your example using HTTPLib
import httplib as h
domain = h.HTTPConnection('www.python.org')
domain.connect()
domain.request('GET', '/fish.html')
response = domain.getresponse()
if response.status == h.OK:
html = response.read()
Upvotes: 0