Reputation: 3
I'm trying to download images from a list of URLs that I've been given. Most of the links either return acceptable links or forbidden links. However, for a certain link, I am able to access it through a browser and it doesn't throw an error when the code attempts to download it. It just gets hung up and runs forever. Is this a problem with urllib, my code, or the link itself and is there a way around this?
import urllib.request
urllib.request.urlretrieve("http://www.mercedsunstar.com/news/9d6aao/picture82035257/alternates/FREE_640/13330875_1110997995625119_2134033517544198418_n", "test_image.jpg")
Upvotes: 0
Views: 171
Reputation: 1032
This specific site is checking for the User-Agent and other headers browsers usually send. If these are not present it won't answer the request at all. Therefore your code is never returning. This mechanism is sometimes used to prevent automated crawling of images or other content, which is probably what your are trying to do.
You could look into the build_opener()
and install_opener()
methods of urllib
to create an opener instance and modify its addheaders
property before using urlretrieve
.
import urllib
opener = urllib.request.build_opener()
opener.addheaders = [("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0")]
urllib.request.install_opener(opener)
After that the code from your question should work as is.
urllib.request.urlretrieve("http://www.mercedsunstar.com/news/9d6aao/picture82035257/alternates/FREE_640/13330875_1110997995625119_2134033517544198418_n", "test_image.jpg")
If you're really crawling the web I would suggest you look into frameworks specifically designed to do that e.g. Scrapy. It offers many convenient functions that probably make it a lot easier to do what you're trying to achieve than building everything from scratch.
Also be advised that they might employ this mechanism for a reason and make sure that you are not infringing on their intellectual property rights.
Upvotes: 3
Reputation: 142631
This page checks 'User-Agent' header to recognize web browser and blocks scripts and bots. urllib
uses string like "Python ..."
so server blocks it.
This code works for me
import urllib.request
req = urllib.request.Request('http://www.mercedsunstar.com/news/9d6aao/picture82035257/alternates/FREE_640/13330875_1110997995625119_2134033517544198418_n')
req.add_header('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64; rv:69.0) Gecko/20100101 Firefox/69.0')
content = urllib.request.urlopen(req).read()
with open("test_image.jpg", 'wb') as f:
f.write(content)
Upvotes: 0