Fal-Cone
Fal-Cone

Reputation: 345

urllib2 returning no HTML

Trying to spider/crawl through a third-party website, but I seem to have hit a snag:

urlopen'ing a site gets a response, but reading and printing the HTML seems to tell me that I'm getting nothing back. Could this be due to some kind of blocking on the other end? or anything?

currently, I'm trying to open New York Times articles. The main pages return HTML, the articles, uh, don't.

try:
    source = urllib.urlopen(target_site)
    html =  source.read()
    print "HTML: ", html.lower()

output:

HTML:
(other stuff)

Oh, and it also times out once in a while, but that's a different story, I'm hoping.

Upvotes: 0

Views: 1315

Answers (3)

Javier
Javier

Reputation: 116

To anybody else running into this issue when using urllib2, the issue might also be that you are only getting back a meta tag with a redirect chain. You can confirm this by printing the result from opening the url and read() it:

<meta http-equiv="refresh" content="0;url=http://www.yourURL.com.mx/ads.txt"/>

Check first to see that you are properly saving cookies into the jar, then take a look at this link: how to follow meta refreshes in Python

Upvotes: 0

Stephen
Stephen

Reputation: 2424

This is not the problem for the New York Times article. It could be refusing you the page because you don't have an appropriate user-agent in the header. This post tells you how to do it.

Try this if it is the case:

try:
    req = urllib2.Request(target_site)
    req.add_header("User-Agent", "Mozilla/5.0")
    source = urllib.urlopen(req)
    html =  source.read()
    print "HTML: ", html.lower()

Scratch that. That's not the problem for the New York Times articles. It's because nytimes.com trys to give you cookies; but it can't, which causes a redirect loop. You need to create a custom url opener that can handle cookies. You can access this by doing:

#make an url opener that can handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieHandler())
#read in the site
response = opener.open(target_site)
html = response.read()

To verify that it is the right article you can write it out and open it in a web browser.

Upvotes: 3

user1074057
user1074057

Reputation: 1812

I thought I would add a plug for requests. It can do this relatively easily. After easy_install requests or pip install requests:

import requests

page = requests.get(page_url)
html = page.content

Edit: I saw the URL posted in the comments to the question and thought I would confirm that requests.get does work with that page.

Upvotes: 0

Related Questions