Reputation: 87
**I Tried to Run this scrapy Query to download the all the related PDF from given URL **
I tried to execute this using "scrapy crawl mySpider"
import urlparse
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "sec_gov"
allowed_domains = ["www.sec.gov"]
start_urls = ["https://secsearch.sec.gov/search?utf8=%3F&affiliate=secsearch&query=exhibit+10"]
def parse(self, response):
for href in response.css('div#all_results h3 a::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.parse_article
)
def parse_article(self, response):
for href in response.css('div.download_wrapper a[href$=".pdf"]::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
Can anyone help me with this ? Thanks in Advance.
Upvotes: 0
Views: 3416
Reputation: 1494
Flaws in the code:
http://www.pwc.com/us/en/tax-services/publications/research-and-insights.html this url is redirecting to https://www.pwc.com/us/en/services/tax/library.html
Also there is no div with the id all_results so no div#all_results exists in the html response returned to the crawler. So the first line of code in the parse method should generate error.
For the scrapy crawl command to work you should be in a directory where the configuration file scrapy.cfg exists.
Edit: I hope this code helps you. It downloads all the pdfs from the given link.
Code:
#import urllib ---> Comment this line
import scrapy
from scrapy.http import Request
class pwc_tax(scrapy.Spider):
name = "pwc_tax"
allowed_domains = ["www.pwc.com"]
start_urls = ["https://www.pwc.com/us/en/services/consulting/analytics/benchmarking-services.html"]
def parse(self, response):
base_url = 'https://www.pwc.com'
for a in response.xpath('//a[@href]/@href'):
link = a.extract()
# self.logger.info(link)
if link.endswith('.pdf'):
#link = urllib.parse.urljoin(base_url, link) -> Comment this
link = base_url + link --> Add this line
self.logger.info(link)
yield Request(link, callback=self.save_pdf)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving PDF %s', path)
with open(path, 'wb') as f:
f.write(response.body)
The code repository can be found at: https://github.com/NilanshBansal/File_download_Scrapy
Upvotes: 1
Reputation: 1334
You should run the command inside the directory where scrapy.cfg is present.
Upvotes: 0