Reputation: 3
So I've been trying to learn how to extract data from websites efficiently using python. Ideally I'd like to gather from stats in a more efficient manner than I currently am doing on the site www.transfermarkt.com, a footballing website but for some reason the site seems to behave different from every other site I've tried. Even the simple code below just gives me basically no response. Can anybody explain why I cant get the HTML of this website but I can with other websites.
import urllib
htmlfile = urllib.urlopen("http://www.transfermarkt.com")
htmltext = htmlfile.read()
print (htmltext)
Upvotes: 0
Views: 1434
Reputation: 8506
From the urllib#urlopen
documentation:
One caveat: the read() method, if the size argument is omitted or negative, may not read until the end of the data stream; there is no good way to determine that the entire stream from a socket has been read in the general case.
If you check the response headers for the site you're trying to read, you'll see that there's no Content-Length
header, this is because the transfer is chunked and you need to read all those chunks before getting the content.
data = htmlfile.read(512)
while data is not None:
htmltext += data
And the thing that Aswin pointed out.
Upvotes: 1
Reputation: 1102
The site you specified had blocked robots in http://www.transfermarkt.com/robots.txt .So you have to access with user agent as a browser.
So basically you code should be
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
response = opener.open("http://www.transfermarkt.com")
print (response.read())
Upvotes: 4