Reputation: 106
I'm trying to scrape all the CSVs from this site: transparentnevada.com
When you navigate to a specific agency i.e. http://transparentnevada.com/salaries/2016/university-nevada-reno/ , and hit Download Records, there's a link to a number of CSVs. I'd like to download all the CSVs.
My spider runs and appears to crawl all the records but isn't downloading anything:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request
class Spider2(CrawlSpider):
#name of the spider
name = 'nevada'
#list of allowed domains
allowed_domains = ['transparentnevada.com']
#starting url for scraping
start_urls = ['http://transparentnevada.com/salaries/all/']
rules = [
Rule(LinkExtractor(
allow=['/salaries/all/*']),
follow=True),
Rule(LinkExtractor(
allow=['/salaries/2016/*/']),
follow=True),
Rule(LinkExtractor(
allow=['/salaries/2016/*/#']),
callback='parse_article',
follow=True),
]
#setting the location of the output csv file
custom_settings = {
'FEED_FORMAT' : "csv",
'FEED_URI' : 'tmp/nevada2.csv'
}
def parse_article(self, response):
for href in response.css('div.view-downloads a[href$=".csv"]::attr(href)').extract():
yield Request(
url=response.urljoin(href),
callback=self.save_pdf
)
def save_pdf(self, response):
path = response.url.split('/')[-1]
self.logger.info('Saving CSV %s', path)
with open(path, 'wb') as f:
f.write(response.body)
Upvotes: 1
Views: 105
Reputation: 146610
The issue is that the CSV are on /export/
and you are doing nothing about them in your rules
I added a simple LinkExtractor to your scraper and it was downloading files
Rule(LinkExtractor(
allow=['/export/.*\.csv']),
callback='save_pdf',
follow=True),
Also your above rules are not 100% correct you have used "/*" when it should be "/.*/".
"/*" means either slash is present or is there multiple times like "////". So fix your rules, add the rule i gave and that should get the work done
Upvotes: 2