Vinod kumar
Vinod kumar

Reputation: 87

Need to Download all .pdf file in given URL using scrapy

**I Tried to Run this scrapy Query to download the all the related PDF from given URL **

I tried to execute this using "scrapy crawl mySpider"

import urlparse
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
    name = "sec_gov"

    allowed_domains = ["www.sec.gov"]
    start_urls = ["https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10"]

    def parse(self, response):
        for href in response.css('div#all_results h3 a::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.parse_article
            )

    def parse_article(self, response):
        for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
            yield Request(
                url=response.urljoin(href),
                callback=self.save_pdf
            )

    def save_pdf(self, response):
        path = response.url.split('/')[-1]
        self.logger.info('Saving PDF %s', path)
        with open(path, 'wb') as f:
            f.write(response.body)

Can anyone help me with this ? Thanks in Advance.

Upvotes: 0

Views: 3416

Answers (2)

nilansh bansal
nilansh bansal

Reputation: 1494

Flaws in the code:

http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html this url is redirecting to https://www.pwc.com/us/en/services/tax/library.html

Also there is no div with the id all_results so no div#all_results exists in the html response returned to the crawler. So the first line of code in the parse method should generate error.

For the scrapy crawl command to work you should be in a directory where the configuration file scrapy.cfg exists.

Edit: I hope this code helps you. It downloads all the pdfs from the given link.

Code:

#import urllib ---> Comment this line
import scrapy

from scrapy.http import Request

class pwc_tax(scrapy.Spider):
  name = "pwc_tax"

  allowed_domains = ["www.pwc.com"]
  start_urls = ["https://www.pwc.com/us/en/services/consulting/analytics/benchmarking-services.html"]

  def parse(self, response):
    base_url = 'https://www.pwc.com'

    for a in response.xpath('//a[@href]/@href'):
        link = a.extract()
        # self.logger.info(link)

        if link.endswith('.pdf'):
            #link = urllib.parse.urljoin(base_url, link) -> Comment this

            link = base_url + link --> Add this line
            self.logger.info(link)
            yield Request(link, callback=self.save_pdf)

  def save_pdf(self, response):
    path = response.url.split('/')[-1]
    self.logger.info('Saving PDF %s', path)
    with open(path, 'wb') as f:
        f.write(response.body)

The code repository can be found at: https://github.com/NilanshBansal/File_download_Scrapy

Upvotes: 1

HariUserX
HariUserX

Reputation: 1334

You should run the command inside the directory where scrapy.cfg is present.

Upvotes: 0

Related Questions