Why can't I access the HTML of some websites

Question

So I've been trying to learn how to extract data from websites efficiently using python. Ideally I'd like to gather from stats in a more efficient manner than I currently am doing on the site www.transfermarkt.com, a footballing website but for some reason the site seems to behave different from every other site I've tried. Even the simple code below just gives me basically no response. Can anybody explain why I cant get the HTML of this website but I can with other websites.

import urllib
htmlfile = urllib.urlopen("http://www.transfermarkt.com")
htmltext = htmlfile.read()
print (htmltext)

fixmycode · Accepted Answer

From the urllib#urlopen documentation:

One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.

If you check the response headers for the site you're trying to read, you'll see that there's no Content-Length header, this is because the transfer is chunked and you need to read all those chunks before getting the content.

data = htmlfile.read(512)
while data is not None:
    htmltext += data

And the thing that Aswin pointed out.

Why can't I access the HTML of some websites

Answers (2)

Related Questions

Why can&#39;t I access the HTML of some websites

Answers (2)

Related Questions

Why can't I access the HTML of some websites