Need to Download all .pdf file in given URL using scrapy

Question

**I Tried to Run this scrapy Query to download the all the related PDF from given URL **

I tried to execute this using "scrapy crawl mySpider"

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "sec_gov"

    allowed_domains = ["www.sec.gov"]
    start_urls = ["https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Can anyone help me with this ? Thanks in Advance.

nilansh bansal · Accepted Answer

Flaws in the code:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html this url is redirecting to https://www.pwc.com/us/en/services/tax/library.html

Also there is no div with the id all_results so no div#all_results exists in the html response returned to the crawler. So the first line of code in the parse method should generate error.

For the scrapy crawl command to work you should be in a directory where the configuration file scrapy.cfg exists.

Edit: I hope this code helps you. It downloads all the pdfs from the given link.

Code:

#import urllib ---> Comment this line
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
  name = "pwc_tax"

  allowed_domains = ["www.pwc.com"]
  start_urls = ["https://www.pwc.com/us/en/services/consulting/analytics/benchmarking-services.html"]

  def parse(self, response):
    base_url = 'https://www.pwc.com'

    for a in response.xpath('//a[@href]/@href'):
        link = a.extract()
        # self.logger.info(link)

        if link.endswith('.pdf'):
            #link = urllib.parse.urljoin(base_url, link) -> Comment this

            link = base_url + link --> Add this line
            self.logger.info(link)
            yield Request(link, callback=self.save_pdf)

  def save_pdf(self, response):
    path = response.url.split('/')[-1]
    self.logger.info('Saving PDF %s', path)
    with open(path, 'wb') as f:
        f.write(response.body)

The code repository can be found at: https://github.com/NilanshBansal/File_download_Scrapy

Need to Download all .pdf file in given URL using scrapy

Answers (2)

Related Questions