Reputation: 7916
Why am I receiving this error?
[scrapy] WARNING: File (code: 302): Error downloading file from <GET <url> referred in <None>
The URL seems to download without any problems in my browser and a 302 is simply a redirect. Why wouldn't scrapy simply follow the redirect to download the file?
process = CrawlerProcess({
'FILES_STORE': 'C:\\Users\\User\\Downloads\\Scrapy',
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'ITEM_PIPELINES': {'scrapy.pipelines.files.FilesPipeline': 1},
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
Upvotes: 2
Views: 2769
Reputation: 861
If redirection is the problem you should add following, in your settings.py :
MEDIA_ALLOW_REDIRECTS = True
Source : Allowing redirections in Scrapy
Upvotes: 4
Reputation: 7130
My solution is use requests to send a http requests first,base on the status_code to choose which url to download, now you can put the url in file_urls or your custom name.
import requests
def check_redirect(url):
response = requests.head(url)
if response.status_code == 302:
url = response.headers["Location"]
return url
or may be you can use custom filespipeline
class MyFilesPipeline(FilesPipeline):
def handle_redirect(self, file_url):
response = requests.head(file_url)
if response.status_code == 302:
file_url = response.headers["Location"]
return file_url
def get_media_requests(self, item, info):
redirect_url = self.handle_redirect(item["file_urls"][0])
yield scrapy.Request(redirect_url)
def item_completed(self, results, item, info):
file_paths = [x['path'] for ok, x in results if ok]
if not file_paths:
raise DropItem("Item contains no images")
item['file_urls'] = file_paths
return item
I used other solution here Scrapy i/o block when downloading files
Upvotes: 4
Reputation: 7916
The root of the problem seems to be this code in pipelines/media.py
:
def _check_media_to_download(self, result, request, info):
if result is not None:
return result
if self.download_func:
# this ugly code was left only to support tests. TODO: remove
dfd = mustbe_deferred(self.download_func, request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
else:
request.meta['handle_httpstatus_all'] = True
dfd = self.crawler.engine.download(request, info.spider)
dfd.addCallbacks(
callback=self.media_downloaded, callbackArgs=(request, info),
errback=self.media_failed, errbackArgs=(request, info))
return dfd
Specifically, the line that sets handle_httpstatus_all
to True
disables the redirect middleware for the download, which triggers the error. I will ask on the scrapy github for the reasons for this.
Upvotes: 3