Victor
Victor

Reputation: 3

Why can't I access the HTML of some websites

So I've been trying to learn how to extract data from websites efficiently using python. Ideally I'd like to gather from stats in a more efficient manner than I currently am doing on the site www.transfermarkt.com, a footballing website but for some reason the site seems to behave different from every other site I've tried. Even the simple code below just gives me basically no response. Can anybody explain why I cant get the HTML of this website but I can with other websites.

import urllib
htmlfile = urllib.urlopen("http://www.transfermarkt.com")
htmltext = htmlfile.read()
print (htmltext)

Upvotes: 0

Views: 1434

Answers (2)

fixmycode
fixmycode

Reputation: 8506

From the urllib#urlopen documentation:

One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.

If you check the response headers for the site you're trying to read, you'll see that there's no Content-Length header, this is because the transfer is chunked and you need to read all those chunks before getting the content.

data = htmlfile.read(512)
while data is not None:
    htmltext += data

And the thing that Aswin pointed out.

Upvotes: 1

Aswin Kumar K P
Aswin Kumar K P

Reputation: 1102

The site you specified had blocked robots in http://www.transfermarkt.com/robots.txt .So you have to access with user agent as a browser.

So basically you code should be

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open("http://www.transfermarkt.com")
print (response.read())

Upvotes: 4

Related Questions