Reputation: 11
My code below doesn't work for the URLs in nytimes which are articles. Please try changing the URL variable to something else and you'll see that it works. Why is that?
#url = "http://www.nytimes.com";
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = urllib.urlopen(url);
htmltext = htmlfile.read();
print htmltext;
Please advise. Thanks.
Upvotes: 1
Views: 331
Reputation: 1936
I think NYT validates your request with cookies. If the request isn't an ordinary request by web browser, the server returns Location header. It makes your request get lost.
The solution is simple. Use cookiejar like this:
import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = opener.open(url)
htmltext = htmlfile.read();
print htmltext
Upvotes: 2
Reputation: 37023
By "doesn't work" I presume you mean it doesn't give you the expected content. I get an empty result when I access that URL using urllib
so this is likely yet another aspect of the NYT's "paywall."
Upvotes: 0