Reputation: 54130
Using Scrapy, I want to download & save the files with different filename.
First of all, if I enable the default files pipeline. The files (may be html/pdf) are downloading perfectly fine.
For renaming, I wrote the following class & overriden file_path
method.
class MyCustomFilePipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
# extract from 191148 http://mywebsite.com/filedownload.asp?pn=191148&yr=2022
pn = re.search(r'(?<=pn\=)\d+', request.url).group()
print(f'{request.url} - {pn}')
print(type(response)) # <-- this prints as <class 'NoneType'>
response_contentype = response.headers['Content-Type'].decode('ASCII')
ext = 'html'
if response_contentype == 'text/html':
ext = 'html'
elif response_contentype == 'application/pdf':
ext = 'pdf'
print(f'{pn}.{ext}') # <-- this is not printed
return f'{pn}.{ext}'
I enabled it in settings.py. In the console, for each request URL, I'm getting the output of both the print statements (in the above code, for debugging).
But the response
is <class 'NoneType'>
.
Surprisingly, print(f'{pn}.{ext}')
isn't being printed.
No files are begin downloaded. files
is not populated
Why isnt the scrapy making requests & getting responses? What am I missing?
Upvotes: 0
Views: 207
Reputation: 1106
print(f'{pn}.{ext}')
is because response is None and response.headers
is an error and it's stopping the rest of the code.one way to make your code work although i don't know if it's the best solution is to wrap your code with if response
class MyCustomFilePipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
if response:
# extract from 191148 http://mywebsite.com/filedownload.asp?pn=191148&yr=2022
pn = re.search(r"(?<=pn\=)\d+", request.url).group()
print(f"{request.url} - {pn}")
print(type(response)) # <-- this prints as <class 'NoneType'>
response_contentype = response.headers["Content-Type"].decode("ASCII")
ext = "html"
if response_contentype == "text/html":
ext = "html"
elif response_contentype == "application/pdf":
ext = "pdf"
print(f"{pn}.{ext}") # <-- this works now
return f"{pn}.{ext}"
Upvotes: 1