user3237941
user3237941

Reputation: 11

urllib.urlopen does not work for this url though mechanize works

My code below doesn't work for the URLs in nytimes which are articles. Please try changing the URL variable to something else and you'll see that it works. Why is that?

#url = "http://www.nytimes.com";
url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = urllib.urlopen(url);
htmltext = htmlfile.read();
print htmltext;

Please advise. Thanks.

Upvotes: 1

Views: 331

Answers (2)

taggon
taggon

Reputation: 1936

I think NYT validates your request with cookies. If the request isn't an ordinary request by web browser, the server returns Location header. It makes your request get lost.

The solution is simple. Use cookiejar like this:

import cookielib, urllib2

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

url = "http://www.nytimes.com/interactive/2014/07/07/upshot/how-england-italy-and-germany-are-dominating-the-world-cup.html"
htmlfile = opener.open(url)
htmltext = htmlfile.read();

print htmltext

Upvotes: 2

holdenweb
holdenweb

Reputation: 37023

By "doesn't work" I presume you mean it doesn't give you the expected content. I get an empty result when I access that URL using urllib so this is likely yet another aspect of the NYT's "paywall."

Upvotes: 0

Related Questions