claws
claws

Reputation: 54130

Scrapy is sending response of type None to my custom file pipeline

Using Scrapy, I want to download & save the files with different filename.

First of all, if I enable the default files pipeline. The files (may be html/pdf) are downloading perfectly fine.

For renaming, I wrote the following class & overriden file_path method.

class MyCustomFilePipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):        
        # extract from 191148 http://mywebsite.com/filedownload.asp?pn=191148&yr=2022
        pn = re.search(r'(?<=pn\=)\d+', request.url).group()
        print(f'{request.url} - {pn}')
        print(type(response)) # <-- this prints as <class 'NoneType'>
        
        response_contentype = response.headers['Content-Type'].decode('ASCII')
        ext = 'html'
        if response_contentype  == 'text/html':
            ext = 'html'
        elif response_contentype == 'application/pdf':
            ext = 'pdf'
        print(f'{pn}.{ext}') # <-- this is not printed  
        return f'{pn}.{ext}'

I enabled it in settings.py. In the console, for each request URL, I'm getting the output of both the print statements (in the above code, for debugging).

But the response is <class 'NoneType'>.

Surprisingly, print(f'{pn}.{ext}') isn't being printed.

No files are begin downloaded. files is not populated

Why isnt the scrapy making requests & getting responses? What am I missing?

Upvotes: 0

Views: 207

Answers (1)

zaki98
zaki98

Reputation: 1106

  1. you are not seeing print(f'{pn}.{ext}') is because response is None and response.headers is an error and it's stopping the rest of the code.
  2. Why is response None: because if you check the code for FilesPipelines the function filepath is executed twice twice once in media_to_download without response (that's why you were getting None) and once in media_downloaded with the response.

one way to make your code work although i don't know if it's the best solution is to wrap your code with if response

class MyCustomFilePipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None, *, item=None):
        if response:
            # extract from 191148 http://mywebsite.com/filedownload.asp?pn=191148&yr=2022
            pn = re.search(r"(?<=pn\=)\d+", request.url).group()
            print(f"{request.url} - {pn}")
            print(type(response))  # <-- this prints as <class 'NoneType'>
            response_contentype = response.headers["Content-Type"].decode("ASCII")
            ext = "html"
            if response_contentype == "text/html":
                ext = "html"
            elif response_contentype == "application/pdf":
                ext = "pdf"
            print(f"{pn}.{ext}")  # <-- this works now
            return f"{pn}.{ext}"

Upvotes: 1

Related Questions