Reputation: 345
Trying to spider/crawl through a third-party website, but I seem to have hit a snag:
urlopen'ing a site gets a response, but reading and printing the HTML seems to tell me that I'm getting nothing back. Could this be due to some kind of blocking on the other end? or anything?
currently, I'm trying to open New York Times articles. The main pages return HTML, the articles, uh, don't.
try:
source = urllib.urlopen(target_site)
html = source.read()
print "HTML: ", html.lower()
output:
HTML:
(other stuff)
Oh, and it also times out once in a while, but that's a different story, I'm hoping.
Upvotes: 0
Views: 1315
Reputation: 116
To anybody else running into this issue when using urllib2, the issue might also be that you are only getting back a meta tag with a redirect chain. You can confirm this by printing the result from opening the url and read() it:
<meta http-equiv="refresh" content="0;url=http://www.yourURL.com.mx/ads.txt"/>
Check first to see that you are properly saving cookies into the jar, then take a look at this link: how to follow meta refreshes in Python
Upvotes: 0
Reputation: 2424
This is not the problem for the New York Times article. It could be refusing you the page because you don't have an appropriate user-agent in the header. This post tells you how to do it.
Try this if it is the case:
try:
req = urllib2.Request(target_site)
req.add_header("User-Agent", "Mozilla/5.0")
source = urllib.urlopen(req)
html = source.read()
print "HTML: ", html.lower()
Scratch that. That's not the problem for the New York Times articles. It's because nytimes.com trys to give you cookies; but it can't, which causes a redirect loop. You need to create a custom url opener that can handle cookies. You can access this by doing:
#make an url opener that can handle cookies
opener = urllib2.build_opener(urllib2.HTTPCookieHandler())
#read in the site
response = opener.open(target_site)
html = response.read()
To verify that it is the right article you can write it out and open it in a web browser.
Upvotes: 3
Reputation: 1812
I thought I would add a plug for requests. It can do this relatively easily. After easy_install requests
or pip install requests
:
import requests
page = requests.get(page_url)
html = page.content
Edit: I saw the URL posted in the comments to the question and thought I would confirm that requests.get
does work with that page.
Upvotes: 0