Reputation: 2096
I found that you can't read from some sites using Python's urllib2(or urllib). An example...
urllib2.urlopen("http://www.dafont.com/").read()
# Returns ''
These sites work when you visit the site with a browser. I can even scrape them using PHP(didn't try other languages). I have seen other sites with the same issue - but can't remember the URL at the moment.
My questions are...
Upvotes: 2
Views: 1928
Reputation: 20470
I believe it gets blocked by the User-Agent. You can change User-Agent using the following sample code:
USERAGENT = 'something'
HEADERS = {'User-Agent': USERAGENT}
req = urllib2.Request(URL_HERE, headers=HEADERS)
f = urllib2.urlopen(req)
s = f.read()
f.close()
Upvotes: 6
Reputation: 2096
I'm the guy who posted the question. I have some suspicions - but not sure about them - that's why I posted the question here.
I think its due to the host blocking the urllib library using robot.txt or htaccess. But not sure about it. Not even sure if its possible.
If you are in Unix, this will work...
contents = commands.getoutput("curl -s '"+url+"'")
Upvotes: 0